[BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table

Wed Mar 26 12:30:50 UTC 2008

On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert <ericgibert at yahoo.fr> wrote:
> Thank you Peter for the correct email of the BioSQL list.
>
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44.
> My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a
> *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
>
>  I have written a script that connects to the taxonomy database of NCBI and get
>  the XML data for the species. Then it updates the taxon table, replacing the
>  ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it
>  after the loading of BioSeqs in the database.

So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl?

>  Example:
>  I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
>  +----------+---------------+-----------------+--------------+
>  | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
>  +----------+---------------+-----------------+--------------+
>  |       13 |          2759 |            NULL | superkingdom |
>  |       14 |         33208 |              13 | kingdom      |
>  |       15 |          6656 |              14 | phylum       |
>  |       16 |          6960 |              15 | superclass   |
>  |       17 |         50557 |              16 | class        |
>  |       18 |          7496 |              17 | no rank      |
>  |       19 |         33339 |              18 | subclass     |
>  |       20 |          6961 |              19 | order        |
>  |       21 |          6962 |              20 | suborder     |
>  |       22 |          6964 |              21 | family       |
>  |       23 |        229390 |              22 | genus        |
>  |       24 |        229391 |              23 | species      |
>
>  No problem.
>
>  Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL'
>  taxons records are inserted by the db.load() BioPython function:

These records are "guess work" based on the lineage in the GenBank
file - we don't know the NCBI taxon ids, so they are NULL, nor the
rank, but there is a scientific name in the lined taxon_name table.  I
am open to the idea of not writing this guessed lineage, and just
writing one entry for the species and the given NCBI taxon ID.

However, as the new entry Orthetrum sabina should share some of its
lineage with Nannophya pygmaea, then I agree Biopython *should* be
re-using those existing taxon entries, if it can match them safely
using the scientific name.  Re-reading the relevant bit of old code,
it doesn't seem to do this.  I've file bug 2475:
http://bugzilla.open-bio.org/show_bug.cgi?id=2475

This is actually a tricky problem, requiring some a 'clever' parent
linkage as you said in your earlier email.  Hilmar wrote this about
the equivalent code in BioPerl:

>>  It's pretty unreliable actually. There is not only synonymy but also
>>  rampant homonymy in taxonomic names. There are plenty of examples
>>  for the same scientific name in use for a plant and for some animal, for
>>  example. So in order to be unambiguous you will need to know (and
>>  check) the kingdom.

See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html

Eric wrote:
>  then I try to run my script: this time I have an update failure because the
> record 34 is the SAME family hence same ncbi_taxon_id as record 22:
> 'duplicate entry on key 2'.
>
>  Either this *unique* index is new and it is a BioSQL "issue" (as said, this index
> did not exist in my previous BioSQL db so I never encountered this issue before),

Hopefully Hilmar from BioSQL can answer this.

> OR the way BioPython "repeats" existing taxons is incorrect/not compatible.
> In that case, when inserting the second BioSeq, record 34 should not be created
> but record 35 (the genus) should "point" to the already existing family at record
> 22 as its father.

This example might be easier to follow if the scientific names from
the taxon_name were included.  I would check the lineage but the NCBI
wepage is being very slow for me right now.

In the short term, as a quick fix, your script could first remove
taxon entries with a blank NCBI taxon ID (and clear any keys pointing
to them).  Not elegent - but it would work.

Thanks Eric

Peter