[BioSQL-l] Loading sequences with novel NCBI taxon id

Peter biopython at maubp.freeserve.co.uk
Sun Mar 16 19:16:04 UTC 2008


On Fri, Mar 14, 2008 Mark Schreiber wrote:
> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.

BioPerl does this to, so there is consensus on this at least.  But see
below regarding the lineage.

>  This has two curious side effects. Because the details associated with
>  a taxid sometimes change (eg common name changes a lot) you can get
>  connected to an outdated version (if your record is newer than your
>  NCBI taxonomy) or you can get connected with a version that is newer
>  than your record which means when you round-trip you don't get
>  complete identity.

This is understandable, even if a little unexpected.

I (Peter) wrote:
>  > > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>  > > could write some minimal taxonomy entry (without any guess work based
>  > > on the species name), in order to record the sequence's taxon

Hilmar Lapp replied:
>  > This is what Bioperl-db does. There isn't any guesswork. If
>  > Bio::Species has lineage information it will also insert the lineage
>  > information, though.

I am planing to fix Biopython so that once again, it will record the
taxon id against new sequences if the species is already in the table,
and add it to the taxonomy if it isn't there already.

Should we also try and add the lineage into the taxon/taxon_name
tables, linking to existing entries based on matching scientific names
where possible?  Or, should we just add a single taxonomy entry for
the new species, with no lineage links at all?

The old Biopython code also used to add taxon table entries for the
full lineage - trying to reuse existing entries based on string
matching to the scientific name field in the taxon_name table.  This
strikes me as a little unreliable (which is why I used the term "guess
work" in my earlier email).  I am also concerned that this complicates
the clean up operation for load_ncbi_taxonomy.pl, but have not looked
into this.

Hilmar Lapp wrote:
>  > If I remember correctly, the script makes (and hence expects) the
>  > primary key and the NCBI taxonomy ID to be identical.

Really? Perhaps I have misunderstood you.  That would cause problems
if we want to record a new sequence entry with species information but
no NCBI taxonomy ID (e.g. an in house sequencing project).  The
Biopython code doesn't seem to assume the taxon table ID bears any
resemblance to the the NCBI taxonomy ID.  When creating new taxon
table entries, we let the database will assign the taxon table id
(primary key).

Peter



More information about the BioSQL-l mailing list