[BioSQL-l] Loading sequences with novel NCBI taxon id

Mon Mar 17 16:08:43 UTC 2008

On Sun, Mar 16, 2008 at 10:54 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>  > Should we [Biopython] also try and add the lineage into the taxon/
>  > taxon_name tables, linking to existing entries based on matching scientific
>  > names where possible?  Or, should we just add a single taxonomy entry
>  > for the new species, with no lineage links at all?
>
>  This should probably depend on how good or complete the lineage
>  information is that you have. BioPerl parses this out of the sequence
>  files (for formats that have it, such as GenBank, EMBL, UniProt), and
>  so except for exotic clades that don't follow the typical patterns it
>  is usually in good shape (though one might say that the majority of
>  clades are exotic).

I'm currently testing with GenBank, EMBL and SwissProt/UniProt files.
Some of these files are several years old, and include have horrible
multi-species SwissProt files with "species" names longer than 255
characters etc.  The good news is that as you pointed out on another
thread on the BioSQL mailing list earlier this month, they don't seem
to do this anymore.

>  Moreover, it's worth noting that the NCBI taxonomy often contains
>  more nodes in a lineage than are shown in the GenBank record. In this
>  case, unless you know which levels (ranks) to print and which not to,
>  having the full NCBI taxonomy information may in fact cause problems
>  for round-tripping.

I've come to accept that taxonomy information won't always survive a round trip.

>  > The old Biopython code also used to add taxon table entries for the
>  > full lineage - trying to reuse existing entries based on string
>  > matching to the scientific name field in the taxon_name table.  This
>  > strikes me as a little unreliable (which is why I used the term "guess
>  > work" in my earlier email).
>
>  It's pretty unreliable actually. There is not only synonymy but also
>  rampant homonymy in taxonomic names. There are plenty of examples for
>  the same scientific name in use for a plant and for some animal, for
>  example. So in order to be unambiguous you will need to know (and
>  check) the kingdom.

I don't think the current Biopython code for recording the lineages checks the
kingdom... could someone point me at the relevant bit of BioPerl and I'll see
if I can understand exactly what they do?

Hilmar Lapp wrote:
>  If I remember correctly, the script makes (and hence expects) the
>  primary key and the NCBI taxonomy ID to be identical.
>  ...
>  Doing that isn't a big deal but I guess this could also be fixed in
>  load_ncbi_taxonomy.pl so that it doesn't need to rely on this
>  assumption. Would someone mind filing the bug report? (We have a
>  BioSQL category now on bugzilla.open-bio.org.)

I've filed Bug 2470 on this, http://bugzilla.open-bio.org/show_bug.cgi?id=2470

Regards,

Peter