[BioSQL-l] load_ncbi_taxonomy.pl
Hilmar Lapp
hlapp at gmx.net
Fri Aug 1 20:15:58 EDT 2008
These sound like reasonable times, depending on your machine
configuration. I suspect that PostgreSQL might even be a bit faster,
as that's a similar time to what I'm observing on my laptop.
BTW if you provide --verbose=2 on the command line you'll get rows/
time statistics. The slowest steps (recomputing nested set values, and
inserting taxon names) average between 900-1800 rows/s on my laptop,
depending on what else is going on (I suspect the spotlight indexer to
contend for the disk drive on occasion). The faster steps (e.g.
inserting taxon nodes) I observe at up to 2500-4000 rows/s.
Thanks for all the testing, it's much appreciated!
-hilmar
On Aug 1, 2008, at 7:24 PM, Peter wrote:
> On Fri, Aug 1, 2008 at 10:04 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Sounds like I at least managed to silence all the complaining of
>> the script
>> ;-) How long did it run? Was it similar to what you've seen earlier
>> or
>> outrageously longer?
>>
>
> I just ran it again (so updating an already complete database):
>
> $ time perl ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql
> --dbuser root --download true
> Downloading NCBI taxon database to taxdata
> Unable to close datastream at ./load_ncbi_taxonomy.pl line 726
> Loading NCBI taxon database in taxdata:
> ... retrieving all taxon nodes in the database
> ... reading in taxon nodes from nodes.dmp
> ... insert / update / delete taxon nodes
> ... updating new parent IDs
> ... (committing nodes)
> ... rebuilding nested set left/right values
> ... reading in taxon names from names.dmp
> ... deleting old taxon names
> ... inserting new taxon names
> ... cleaning up
> Done.
>
> real 18m29.409s
> user 2m28.149s
> sys 0m18.025s
>
> Some of that is of course the download time, so without that:
>
> $ time perl ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql
> --dbuser root Loading NCBI taxon database in taxdata:
> ... retrieving all taxon nodes in the database
> ... reading in taxon nodes from nodes.dmp
> ... insert / update / delete taxon nodes
> ... updating new parent IDs
> ... (committing nodes)
> ... rebuilding nested set left/right values
> ... reading in taxon names from names.dmp
> ... deleting old taxon names
> ... inserting new taxon names
> ... cleaning up
> Done.
>
> real 13m18.777s
> user 2m17.285s
> sys 0m14.821s
>
> This is slow, with plenty of disk activity during the taxon names bit.
> However, I haven't got the equivalent numbers from the previous
> script to hand (and its after midnight here so I won't re-run it now).
> I'd have guessed it used to be about 10 minutes on this machine
> though, i.e. it is probably taking longer, but it was already longer
> than I liked.
>
> I don't know if that helped, but as I said, I hope to do a more
> thorough job later on.
>
> Peter
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BioSQL-l
mailing list