[BioSQL-l] Concerns the update of BioSQL.taxon table

Wed Mar 26 11:29:24 UTC 2008

Thank you Peter for the correct email of the BioSQL list.

No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.

I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.

Example:
I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
+----------+---------------+-----------------+--------------+
| taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
+----------+---------------+-----------------+--------------+
|       13 |          2759 |            NULL | superkingdom |
|       14 |         33208 |              13 | kingdom      |
|       15 |          6656 |              14 | phylum       |
|       16 |          6960 |              15 | superclass   |
|       17 |         50557 |              16 | class        |
|       18 |          7496 |              17 | no rank      |
|       19 |         33339 |              18 | subclass     |
|       20 |          6961 |              19 | order        |
|       21 |          6962 |              20 | suborder     |
|       22 |          6964 |              21 | family       |
|       23 |        229390 |              22 | genus        |
|       24 |        229391 |              23 | species      |

No problem.

Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
|       25 |          NULL |            NULL | NULL         |
|       26 |          NULL |              25 | NULL         |
|       27 |          NULL |              26 | NULL         |
|       28 |          NULL |              27 | NULL         |
|       29 |          NULL |              28 | NULL         |
|       30 |          NULL |              29 | NULL         |
|       31 |          NULL |              30 | NULL         |
|       32 |          NULL |              31 | NULL         |
|       33 |          NULL |              32 | NULL         |
|       34 |          NULL |              33 | NULL         |
|       35 |          NULL |              34 | genus        |
|       36 |        320892 |              35 | species      |

then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.

Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.

Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.

Best regards,

Eric

      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. http://mail.yahoo.fr