[BioSQL-l] Re: load_ncbi_taxonomy

Hilmar Lapp hlapp at gnf.org
Sun Jun 1 04:31:28 EDT 2003


I was testing with a download from NCBI that was a couple months old. I  
went 'real' now with a current download - and voila - it doesn't work  
anymore!

It turns out the current NCBI taxon database contains nodes that  
reference inexistent parents (!). This is what you call a reference  
database. Wonderful.

biosql=> select * from taxon t where not exists (select 1 from taxon n  
where n.taxon_id = t.parent_taxon_id);
  taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | genetic_code  
| mito_genetic_code | left_value | right_value
----------+---------------+-----------------+-----------+-------------- 
+-------------------+------------+-------------
    120645 |        120645 |          232080 | no rank   |            1  
|                 4 |     205944 |      205945
    169388 |        169388 |          232080 | no rank   |            1  
|                 4 |     205937 |      205956
     13617 |         13617 |          232194 | genus     |            1  
|                 5 |     280206 |      280223
    114837 |        114837 |          232208 | species   |            1  
|                 4 |     210892 |      210893
    213248 |        213248 |          232115 | species   |           11  
|                 0 |      79778 |       79779
    213860 |        213860 |          232120 | genus     |            1  
|                 5 |     280412 |      280415
    227072 |        227072 |          232120 | genus     |            1  
|                 5 |            |
    227073 |        227073 |          232149 | species   |            1  
|                 5 |            |
    227074 |        227074 |          232152 | species   |            1  
|                 5 |            |


I checked some of these in the nodes.dmp file, the parents are indeed  
absent.

So I guess we

	1) forget about enforcing a foreign key constraint on parent_taxon_id  
(as NCBI apparently happily doesn't do that either)

or

	2) audaciously delete the nodes with bogus parents at the end of  
uploading the nodes, and leave it to NCBI to bring them back next time  
with a better parent designation.

Any votes?

	-hilmar

On Sunday, June 1, 2003, at 03:21  AM, Hilmar Lapp wrote:

> I've finally got this under control, with some optimizations in the  
> code, and most importantly some temporary surgical removal of  
> constraints.
>
> It turned out there were two performance killer: 1) the insert rule on  
> the taxon table (which would need the ncbi_taxon_id index to be  
> analyzed regularly in order to perform properly), and 2) the  
> self-referential foreign key constraint (even though it was  
> deferred!). At first it also looked that nested set rebuild phase  
> needs regular VACUUM ANALYZE runs (which require a commit/begin  
> cycle), but it seems now that's not necessarily true. I wrote a few  
> pgsql functions as an API that the script uses to remove and restore  
> constraints. You will need to install those functions to take  
> advantage of it.
>
> Anyway, the whole NCBI taxon database loads now on my laptop in 30-40  
> mins, which is not too bad I think. It should be considerably faster  
> on a beefy server with multiple disks (i.e., with the transaction log  
> on another disk than the table files).
>
> I also added a check that when rebuilding the nested set avoids  
> updates when unnecessary and obtains the children ordered so that they  
> would always come out the same. This should make updates much faster.
>
> 	-hilmar
>
> On Friday, May 30, 2003, at 12:33  PM, Aaron J Mackey wrote:
>
>>
>> I've had the same response from Postgres; dropping the constraint  
>> checking
>> seemed to help, but it was still slow (on a beefy database server, it
>> finishes in around 20 hours, with no other load).  I suspect that
>> maintaining all the indices is probably slow (but to get them back  
>> will
>> still require the time, even if you drop them while loading the data).
>>
>> Postgres is not a speedy database.  It has other nice features,  
>> though.
>>
>> -Aaron
>>
>> On Fri, 30 May 2003, Hilmar Lapp wrote:
>>
>>> I'm adding a bit documentation, and among the more important things,
>>> I'm adding support for Oracle and MySQL and some more verbosity about
>>> what's going on. I also fixed the problem with the ncbi_taxon_id  
>>> being
>>> mysteriously missing in the uploaded data (I suspect the root cause  
>>> is
>>> a perl 5.6.0-related bug).
>>>
>>> The snag I've hit is with Postgres. It's unbelievably slow on a fresh
>>> database, when really all it should be doing in the first pass is
>>> inserting nodes. It starts with ~370 rows/s (this is *after* dropping
>>> the insert rule on taxon, otherwise it's very slow right away) and
>>> drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows,
>>> respectively. Updating the nested set values takes place at 3-5  
>>> rows/s
>>> (!). So, it wouldn't even finish over night.
>>>
>>> This is data from my PowerBook G4, 800 Mhz, single disk, 4096 shared
>>> buffers in Pg, Postgres 7.3.2. Certainly not the most suitable  
>>> hardware
>>> configuration, but the rest should be OK. Note that in comparison  
>>> MySQL
>>> does ~2000 rows/s during the insert phase and ~500 rows/s during the
>>> update phase, on the same hardware.
>>>
>>> Is anyone else running this script under Postgres? What's the
>>> performance? If it does finish in a reasonable time, what is your Pg
>>> configuration, and do you take special measures, like dropping an  
>>> index
>>> before or something?
>>>
>>> There are still various options left to speed up the whole thing, but
>>> that must be entirely driver specific then. In particular,
>>>
>>> 	- vacuum analyze after the insert phase
>>> 	- drop all indexes on taxon except the primary key during the insert
>>> phase (many of them are needed though during the update phase)
>>>
>>> I'm just wondering whether it's only me and I'm doing something  
>>> wrong.
>>> I doubt that I screwed up the code to this effect as I haven't really
>>> touched the meat of it (SQL statements, code flow, algorithm).
>>>
>>> Unless Aaron wants me to, I'm not particularly happy committing the
>>> updated version with the fixes before I can't get this to the point  
>>> of
>>> making at least one full test run on Pg... (apart from MySQL and
>>> Oracle).
>>>
>>> 	-hilmar
>>>
>>
>> -- 
>>  Aaron J Mackey
>>  Pearson Laboratory
>>  University of Virginia
>>  (434) 924-2821
>>  amackey at virginia.edu
>>
>>
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list