[BioSQL-l] Loading sequences with novel NCBI taxon id

Fri Mar 14 09:48:38 EDT 2008

>From memory BioJava will add it if it is not already in there. If the
taxid can be found then the system connects you with whatever is in
that taxid, it doesn't overwrite it.

This has two curious side effects. Because the details associated with
a taxid sometimes change (eg common name changes a lot) you can get
connected to an outdated version (if your record is newer than your
NCBI taxonomy) or you can get connected with a version that is newer
than your record which means when you round-trip you don't get
complete identity.

For compatibility across the projects some kind of consensus would be good.

- Mark

On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>
> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>
> > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >> [...]
>
> >>  The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>  tables in a non-disruptive way, and if there weren't many changes
> >>  shouldn't actually take that long (except that recalculating the
> >>  nested set values may take a couple of minutes).
> >
> > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> > could write some minimal taxonomy entry (without any guess work based
> > on the species name), in order to record the sequence's taxon
>
> This is what Bioperl-db does. There isn't any guesswork. If
> Bio::Species has lineage information it will also insert the lineage
> information, though.
>
>
> > - and then running an improved load_ncbi_taxonomy.pl at a later
> > date would
> > sort out the proper taxonomy?
>
> If I remember correctly, the script makes (and hence expects) the
> primary key and the NCBI taxonomy ID to be identical. If your loading
> procedure can achieve that already then load_ncbi_taxonomy.pl should
> pick them up and fix them. You can try that by loading the taxonomy
> through the script, then arbitrarily choose a taxon, create a stub
> bioentry for it and set its taxon_id foreign key to the chosen
> taxon,  change its taxon_name.name to some bogus value (for the
> 'scientific name' class, for example) (and feel free to change the
> left_id and right_id values in taxon too), and rerun the script. It
> should fix the change you made, and your bioentry should still point
> to the same taxon (because its primary key did not change, and did
> not get deleted either; otherwise the bioentry would now have a null
> value in the foreign key).
>
> The Bioperl-db way of storing things does not give control over
> primary key assignment to Bioperl-db, so the database will assign it.
>
> > [...]
>
> >>  For the SymAtlas project we had this situation (new species in
> >>  sequence updates that the last NCBI taxonomy update hadn't yet
> >>  brought in) quite regularly. I wrote a SQL script would fix those
> >>  'haphazard' additions such that load_ncbi_taxonomy would update them
> >>  to their correct values come the next NCBI taxonomy update. I can
> >>  send you the script (it would be for the Oracle version), but I'm
> >> not
> >>  sure this is a widely viable strategy.
> >
> > So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>
> No, but now that you say it I don't see any reason why I couldn't.
> Maybe that's just what I should do.
>
>        -hilmar
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
> _______________________________________________
>
>
>
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>