[Bioperl-l] [BioSQL-l] Loading sequences with novel NCBI taxon_id

Sat Apr 12 01:32:14 UTC 2008

Chris Fields wrote:
> The counter to that perspective (using new sequences with old tax info) 
> would be to regularly update NCBI taxonomy, particularly in 
> circumstances prior to adding new sequences.  Hilmar mentioned that once 
> tax is loaded it doesn't take as long to update, so you could set up a 
> cron job to update regularly.
> 
> I remember someone mentioning weekly or monthly updates on the list 
> quite a while ago, but I'm unsure how often NCBI updates tax information 
> (i.e. with every release, monthly, weekly, etc).  I can see instances 
> popping up where you used the an up-to-date taxonomy but a new sequence 
> contains a tax ID not present.  I think bioperl-db handles these but I'm 
> not sure what other Bio* do.
> 

I spent some time benchmarking this and inspecting the mysql log files.
The current load_ncbi_taxonomy.pl script with minor modification to
show timestamps does this on initial import into mysql and then update
of the database using exactly same dataset (but anyway it has to walk
through all the data):

$ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01 \
  --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \
  --chunksize=0 --verbose=2 --mycnf=~/.my.cnf
Sat Apr 12 01:58:43 MEST 2008
Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump:
       ... retrieving all taxon nodes in the database
Sat Apr 12 01:58:43 MEST 2008
       ... reading in taxon nodes from nodes.dmp
Sat Apr 12 01:58:58 MEST 2008
       ... insert / update / delete taxon nodes
                10000/421098 done (in 5 secs, 2000.0 rows/s)
                20000/421098 done (in 4 secs, 2500.0 rows/s)
...
                420000/421098 done (in 4 secs, 2500.0 rows/s)
Sat Apr 12 02:02:21 MEST 2008
       ... (committing nodes)
Sat Apr 12 02:02:21 MEST 2008
       ... rebuilding nested set left/right values
                10000 done (in 24 secs, 416.7 rows/s)
                20000 done (in 26 secs, 384.6 rows/s)
                30000 done (in 24 secs, 416.7 rows/s)
...
                420004 done (in 23 secs, 434.8 rows/s)
Sat Apr 12 02:19:25 MEST 2008
       ... reading in taxon names from names.dmp
Sat Apr 12 02:19:25 MEST 2008
       ... deleting old taxon names
Sat Apr 12 02:19:25 MEST 2008
       ... inserting new taxon names
                10000 done (in 8 secs, 1250.0 rows/s)
                20000 done (in 8 secs, 1250.0 rows/s)
...
                580000 done (in 5 secs, 2000.0 rows/s)
Sat Apr 12 02:24:48 MEST 2008
       ... cleaning up
Sat Apr 12 02:24:49 MEST 2008
Done.
$

I decided to re-import the same data to mimic at least somehow
the future updates, although no record should be UPDATEd,
except zapping left and right values with NULL. :((

$ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01
  --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \
  --chunksize=0 --verbose=2 --mycnf=~/.my.cnf
Sat Apr 12 02:35:20 MEST 2008
Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump:
        ... retrieving all taxon nodes in the database
Sat Apr 12 02:35:26 MEST 2008
       ... reading in taxon nodes from nodes.dmp
Sat Apr 12 02:35:46 MEST 2008
       ... insert / update / delete taxon nodes
                10000/421098 done (in 0 secs, 10000.0 rows/s)
                20000/421098 done (in 0 secs, 10000.0 rows/s)
...
                410000/421098 done (in 0 secs, 10000.0 rows/s)
                420000/421098 done (in 0 secs, 10000.0 rows/s)
Sat Apr 12 02:35:55 MEST 2008
       ... (committing nodes)
Sat Apr 12 02:35:55 MEST 2008
       ... rebuilding nested set left/right values
                10000 done (in 9 secs, 1111.1 rows/s)
                20000 done (in 9 secs, 1111.1 rows/s)
...
                410004 done (in 8 secs, 1250.0 rows/s)
                420004 done (in 9 secs, 1111.1 rows/s)
Sat Apr 12 02:41:54 MEST 2008
       ... reading in taxon names from names.dmp
Sat Apr 12 02:41:54 MEST 2008
       ... deleting old taxon names
Sat Apr 12 02:41:55 MEST 2008
       ... inserting new taxon names
                10000 done (in 5 secs, 2000.0 rows/s)
                20000 done (in 5 secs, 2000.0 rows/s)
...
                570000 done (in 6 secs, 1666.7 rows/s)
                580000 done (in 5 secs, 2000.0 rows/s)
Sat Apr 12 02:47:27 MEST 2008
       ... cleaning up
Sat Apr 12 02:47:27 MEST 2008
Done.
$ ls -la /var/log/mysql/mysql.log 
-rw-rw---- 1 mysql mysql 483443314 Apr 12 03:15 /var/log/mysql/mysql.log
$

Pentium4 M laptop, 1.8GHz, 1 GB RAM, mysql-5.0.56 with enabled
SQL text logging, the slow version of logging all SQL commands
compared to binary logging. The log was cleared before the tests.
I could provide some bits from the log or upload it somewhere
if anybody else would like to dig into the details.

I believe the recalculation step could be made faster. See what
happens:

                     31 Query       SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '1' ORDER BY ncbi_taxon_id
                     31 Query       SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '10239' ORDER BY ncbi_taxon_id
                     31 Query       SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12333' ORDER BY ncbi_taxon_id
                     31 Query       SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12335' ORDER BY ncbi_taxon_id
                     31 Query       UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '4'
                     31 Query       UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '5'
                     31 Query       UPDATE taxon SET left_value = '4', right_value = '5' WHERE taxon_id = '12335'
                     31 Query       SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12340' ORDER BY ncbi_taxon_id
                     31 Query       UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '6'
                     31 Query       UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '7'
                     31 Query       UPDATE taxon SET left_value = '6', right_value = '7' WHERE taxon_id = '12340'

The columns left_value and right_value have NULL value upon
the table is created, so no need to write again NULL into
them. This would mean writing a wrapper function which would
mimic update() but before doing that it would do 'SELECT * FROM',
compare the values with those to be written and include in the
final UPDATE statement only those columns for which values have
been changed. We use such a smart wrapper for our code in python.
;-)

When the columns for left and right are to be made NULL during
update of an existing database, I think it would be much faster
to drop the columns and re-create them again with NULL values.

I think it could be investigated more the possibility to create
empty taxon and taxon_name tables as MyISAM tables and only after
all the import and updates they could be converted into InnoDB
tables. One would have to probably think a bit more of the foreign
keys but it might be they would not even be lost during the conversion
back and forth.

Actually, easy to check. Dump your current taxon and taxon_name
tables (maybe even without sql data using --without-data), run
'ALTER TABLE taxon ... type=MyISAM'
followed by
'ALTER TABLE taxon ... type=InnoDB'
dump again the database structure and compare by diff with
the original.

But, time for sleep here.
Martin