[BioSQL-l] Re: load_ncbi_taxonomy

Sun Jun 1 13:45:13 EDT 2003

:-)

I don't even have a strong feeling in this case as with the new taxon  
tables we've made a strong move towards deferring taxon data to an  
authority (whose errors are beyond our control).

Obviously, if we remove the constraint completely, there is nothing  
that prevents you from removing a connecting node (e.g., 'Primates' :)  
so if someone wanted to mess up his taxon data he wouldn't notice until  
it's too late. I'd be OK with this risk in this case ...

	-hilmar

On Sunday, June 1, 2003, at 03:51  AM, Aaron J Mackey wrote:

>
> I know how much you love constraints, Hilmar, but I think that dropping
> the constraint may be the best way to go (in general, when slurping  
> from
> outside sources, we always seem to run into things like this, various
> small violations of basic assumptions; constraints help you to  
> recognize
> them, but also create obstacles for continuing work).
>
> -Aaron
>
> On Sun, 1 Jun 2003, Hilmar Lapp wrote:
>
>> I was testing with a download from NCBI that was a couple months old.  
>> I
>> went 'real' now with a current download - and voila - it doesn't work
>> anymore!
>>
>> It turns out the current NCBI taxon database contains nodes that
>> reference inexistent parents (!). This is what you call a reference
>> database. Wonderful.
>>
>> biosql=> select * from taxon t where not exists (select 1 from taxon n
>> where n.taxon_id = t.parent_taxon_id);
>>   taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank |  
>> genetic_code
>> | mito_genetic_code | left_value | right_value
>> ----------+---------------+-----------------+----------- 
>> +--------------
>> +-------------------+------------+-------------
>>     120645 |        120645 |          232080 | no rank   |             
>> 1
>> |                 4 |     205944 |      205945
>>     169388 |        169388 |          232080 | no rank   |             
>> 1
>> |                 4 |     205937 |      205956
>>      13617 |         13617 |          232194 | genus     |             
>> 1
>> |                 5 |     280206 |      280223
>>     114837 |        114837 |          232208 | species   |             
>> 1
>> |                 4 |     210892 |      210893
>>     213248 |        213248 |          232115 | species   |            
>> 11
>> |                 0 |      79778 |       79779
>>     213860 |        213860 |          232120 | genus     |             
>> 1
>> |                 5 |     280412 |      280415
>>     227072 |        227072 |          232120 | genus     |             
>> 1
>> |                 5 |            |
>>     227073 |        227073 |          232149 | species   |             
>> 1
>> |                 5 |            |
>>     227074 |        227074 |          232152 | species   |             
>> 1
>> |                 5 |            |
>>
>>
>> I checked some of these in the nodes.dmp file, the parents are indeed
>> absent.
>>
>> So I guess we
>>
>> 	1) forget about enforcing a foreign key constraint on parent_taxon_id
>> (as NCBI apparently happily doesn't do that either)
>>
>> or
>>
>> 	2) audaciously delete the nodes with bogus parents at the end of
>> uploading the nodes, and leave it to NCBI to bring them back next time
>> with a better parent designation.
>>
>> Any votes?
>>
>> 	-hilmar
>>
>> On Sunday, June 1, 2003, at 03:21  AM, Hilmar Lapp wrote:
>>
>>> I've finally got this under control, with some optimizations in the
>>> code, and most importantly some temporary surgical removal of
>>> constraints.
>>>
>>> It turned out there were two performance killer: 1) the insert rule  
>>> on
>>> the taxon table (which would need the ncbi_taxon_id index to be
>>> analyzed regularly in order to perform properly), and 2) the
>>> self-referential foreign key constraint (even though it was
>>> deferred!). At first it also looked that nested set rebuild phase
>>> needs regular VACUUM ANALYZE runs (which require a commit/begin
>>> cycle), but it seems now that's not necessarily true. I wrote a few
>>> pgsql functions as an API that the script uses to remove and restore
>>> constraints. You will need to install those functions to take
>>> advantage of it.
>>>
>>> Anyway, the whole NCBI taxon database loads now on my laptop in 30-40
>>> mins, which is not too bad I think. It should be considerably faster
>>> on a beefy server with multiple disks (i.e., with the transaction log
>>> on another disk than the table files).
>>>
>>> I also added a check that when rebuilding the nested set avoids
>>> updates when unnecessary and obtains the children ordered so that  
>>> they
>>> would always come out the same. This should make updates much faster.
>>>
>>> 	-hilmar
>>>
>>> On Friday, May 30, 2003, at 12:33  PM, Aaron J Mackey wrote:
>>>
>>>>
>>>> I've had the same response from Postgres; dropping the constraint
>>>> checking
>>>> seemed to help, but it was still slow (on a beefy database server,  
>>>> it
>>>> finishes in around 20 hours, with no other load).  I suspect that
>>>> maintaining all the indices is probably slow (but to get them back
>>>> will
>>>> still require the time, even if you drop them while loading the  
>>>> data).
>>>>
>>>> Postgres is not a speedy database.  It has other nice features,
>>>> though.
>>>>
>>>> -Aaron
>>>>
>>>> On Fri, 30 May 2003, Hilmar Lapp wrote:
>>>>
>>>>> I'm adding a bit documentation, and among the more important  
>>>>> things,
>>>>> I'm adding support for Oracle and MySQL and some more verbosity  
>>>>> about
>>>>> what's going on. I also fixed the problem with the ncbi_taxon_id
>>>>> being
>>>>> mysteriously missing in the uploaded data (I suspect the root cause
>>>>> is
>>>>> a perl 5.6.0-related bug).
>>>>>
>>>>> The snag I've hit is with Postgres. It's unbelievably slow on a  
>>>>> fresh
>>>>> database, when really all it should be doing in the first pass is
>>>>> inserting nodes. It starts with ~370 rows/s (this is *after*  
>>>>> dropping
>>>>> the insert rule on taxon, otherwise it's very slow right away) and
>>>>> drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows,
>>>>> respectively. Updating the nested set values takes place at 3-5
>>>>> rows/s
>>>>> (!). So, it wouldn't even finish over night.
>>>>>
>>>>> This is data from my PowerBook G4, 800 Mhz, single disk, 4096  
>>>>> shared
>>>>> buffers in Pg, Postgres 7.3.2. Certainly not the most suitable
>>>>> hardware
>>>>> configuration, but the rest should be OK. Note that in comparison
>>>>> MySQL
>>>>> does ~2000 rows/s during the insert phase and ~500 rows/s during  
>>>>> the
>>>>> update phase, on the same hardware.
>>>>>
>>>>> Is anyone else running this script under Postgres? What's the
>>>>> performance? If it does finish in a reasonable time, what is your  
>>>>> Pg
>>>>> configuration, and do you take special measures, like dropping an
>>>>> index
>>>>> before or something?
>>>>>
>>>>> There are still various options left to speed up the whole thing,  
>>>>> but
>>>>> that must be entirely driver specific then. In particular,
>>>>>
>>>>> 	- vacuum analyze after the insert phase
>>>>> 	- drop all indexes on taxon except the primary key during the  
>>>>> insert
>>>>> phase (many of them are needed though during the update phase)
>>>>>
>>>>> I'm just wondering whether it's only me and I'm doing something
>>>>> wrong.
>>>>> I doubt that I screwed up the code to this effect as I haven't  
>>>>> really
>>>>> touched the meat of it (SQL statements, code flow, algorithm).
>>>>>
>>>>> Unless Aaron wants me to, I'm not particularly happy committing the
>>>>> updated version with the fixes before I can't get this to the point
>>>>> of
>>>>> making at least one full test run on Pg... (apart from MySQL and
>>>>> Oracle).
>>>>>
>>>>> 	-hilmar
>>>>>
>>>>
>>>> --
>>>>  Aaron J Mackey
>>>>  Pearson Laboratory
>>>>  University of Virginia
>>>>  (434) 924-2821
>>>>  amackey at virginia.edu
>>>>
>>>>
>>>>
>>> --
>>> -------------------------------------------------------------
>>> Hilmar Lapp                            email: lapp at gnf.org
>>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>>> -------------------------------------------------------------
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> -- 
>  Aaron J Mackey
>  Pearson Laboratory
>  University of Virginia
>  (434) 924-2821
>  amackey at virginia.edu
>
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------