[BioSQL-l] Loading sequences with novel NCBI taxon id

Sun Mar 16 22:54:45 UTC 2008

On Mar 16, 2008, at 3:16 PM, Peter wrote:

> [...] I (Peter) wrote:
>>>> Do you think when faced with a novel taxon id, Biopython/ 
>>>> BioPerl/...
>>>> could write some minimal taxonomy entry (without any guess work  
>>>> based
>>>> on the species name), in order to record the sequence's taxon
>
> Hilmar Lapp replied:
>>> This is what Bioperl-db does. There isn't any guesswork. If
>>> Bio::Species has lineage information it will also insert the lineage
>>> information, though.
>
> I am planing to fix Biopython so that once again, it will record the
> taxon id against new sequences if the species is already in the table,
> and add it to the taxonomy if it isn't there already.
>
> Should we also try and add the lineage into the taxon/taxon_name
> tables, linking to existing entries based on matching scientific names
> where possible?  Or, should we just add a single taxonomy entry for
> the new species, with no lineage links at all?

This should probably depend on how good or complete the lineage  
information is that you have. BioPerl parses this out of the sequence  
files (for formats that have it, such as GenBank, EMBL, UniProt), and  
so except for exotic clades that don't follow the typical patterns it  
is usually in good shape (though one might say that the majority of  
clades are exotic).

Moreover, it's worth noting that the NCBI taxonomy often contains  
more nodes in a lineage than are shown in the GenBank record. In this  
case, unless you know which levels (ranks) to print and which not to,  
having the full NCBI taxonomy information may in fact cause problems  
for round-tripping.

>
> The old Biopython code also used to add taxon table entries for the
> full lineage - trying to reuse existing entries based on string
> matching to the scientific name field in the taxon_name table.  This
> strikes me as a little unreliable (which is why I used the term "guess
> work" in my earlier email).

It's pretty unreliable actually. There is not only synonymy but also  
rampant homonymy in taxonomic names. There are plenty of examples for  
the same scientific name in use for a plant and for some animal, for  
example. So in order to be unambiguous you will need to know (and  
check) the kingdom.

> I am also concerned that this complicates the clean up operation  
> for load_ncbi_taxonomy.pl, but have not looked into this.

It shouldn't. The script makes no difference between tip (species or  
subspecies) nodes or internal nodes.

>
> Hilmar Lapp wrote:
>>> If I remember correctly, the script makes (and hence expects) the
>>> primary key and the NCBI taxonomy ID to be identical.
>
> Really? Perhaps I have misunderstood you.  That would cause problems
> if we want to record a new sequence entry with species information but
> no NCBI taxonomy ID (e.g. an in house sequencing project).  The
> Biopython code doesn't seem to assume the taxon table ID bears any
> resemblance to the the NCBI taxonomy ID.  When creating new taxon
> table entries, we let the database will assign the taxon table id
> (primary key).

Right, that's what I said Bioperl-db does too, and is the reason I  
had to regularly run that SQL script that would migrate the primary  
keys.

Doing that isn't a big deal but I guess this could also be fixed in  
load_ncbi_taxonomy.pl so that it doesn't need to rely on this  
assumption. Would someone mind filing the bug report? (We have a  
BioSQL category now on bugzilla.open-bio.org.)

Cheers,

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================