[BioSQL-l] load_ncbi_taxonomy.pl

Sat Aug 2 00:15:58 UTC 2008

These sound like reasonable times, depending on your machine  
configuration. I suspect that PostgreSQL might even be a bit faster,  
as that's a similar time to what I'm observing on my laptop.

BTW if you provide --verbose=2 on the command line you'll get rows/ 
time statistics. The slowest steps (recomputing nested set values, and  
inserting taxon names) average between 900-1800 rows/s on my laptop,  
depending on what else is going on (I suspect the spotlight indexer to  
contend for the disk drive on occasion). The faster steps (e.g.  
inserting taxon nodes) I observe at up to 2500-4000 rows/s.

Thanks for all the testing, it's much appreciated!

	-hilmar

On Aug 1, 2008, at 7:24 PM, Peter wrote:

> On Fri, Aug 1, 2008 at 10:04 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Sounds like I at least managed to silence all the complaining of  
>> the script
>> ;-) How long did it run? Was it similar to what you've seen earlier  
>> or
>> outrageously longer?
>>
>
> I just ran it again (so updating an already complete database):
>
> $ time perl ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql
> --dbuser root --download true
> Downloading NCBI taxon database to taxdata
> Unable to close datastream at ./load_ncbi_taxonomy.pl line 726
> Loading NCBI taxon database in taxdata:
>        ... retrieving all taxon nodes in the database
>        ... reading in taxon nodes from nodes.dmp
>        ... insert / update / delete taxon nodes
>        ... updating new parent IDs
>        ... (committing nodes)
>        ... rebuilding nested set left/right values
>        ... reading in taxon names from names.dmp
>        ... deleting old taxon names
>        ... inserting new taxon names
>        ... cleaning up
> Done.
>
> real    18m29.409s
> user    2m28.149s
> sys     0m18.025s
>
> Some of that is of course the download time, so without that:
>
> $ time perl ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql
> --dbuser root Loading NCBI taxon database in taxdata:
>        ... retrieving all taxon nodes in the database
>        ... reading in taxon nodes from nodes.dmp
>        ... insert / update / delete taxon nodes
>        ... updating new parent IDs
>        ... (committing nodes)
>        ... rebuilding nested set left/right values
>        ... reading in taxon names from names.dmp
>        ... deleting old taxon names
>        ... inserting new taxon names
>        ... cleaning up
> Done.
>
> real    13m18.777s
> user    2m17.285s
> sys     0m14.821s
>
> This is slow, with plenty of disk activity during the taxon names bit.
> However, I haven't got the equivalent numbers from the previous
> script to hand (and its after midnight here so I won't re-run it now).
> I'd have guessed it used to be about 10 minutes on this machine
> though, i.e. it is probably taking longer, but it was already longer
> than I liked.
>
> I don't know if that helped, but as I said, I hope to do a more
> thorough job later on.
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================