[Bioperl-l] [BioSQL-l] Loading sequences with novel NCBI taxon id

Mark Schreiber markjschreiber at gmail.com
Sat Mar 15 00:56:37 UTC 2008


I agree. A regular update would be best.

Of course if your BioSQL db is limited to one or a few organisms you can
just keep a fragment of the db.

- Mark

On Fri, Mar 14, 2008 at 10:31 PM, Chris Fields <cjfields at uiuc.edu> wrote:

> The counter to that perspective (using new sequences with old tax
> info) would be to regularly update NCBI taxonomy, particularly in
> circumstances prior to adding new sequences.  Hilmar mentioned that
> once tax is loaded it doesn't take as long to update, so you could set
> up a cron job to update regularly.
>
> I remember someone mentioning weekly or monthly updates on the list
> quite a while ago, but I'm unsure how often NCBI updates tax
> information (i.e. with every release, monthly, weekly, etc).  I can
> see instances popping up where you used the an up-to-date taxonomy but
> a new sequence contains a tax ID not present.  I think bioperl-db
> handles these but I'm not sure what other Bio* do.
>
> chris
>
> On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:
>
> >> From memory BioJava will add it if it is not already in there. If the
> > taxid can be found then the system connects you with whatever is in
> > that taxid, it doesn't overwrite it.
> >
> > This has two curious side effects. Because the details associated with
> > a taxid sometimes change (eg common name changes a lot) you can get
> > connected to an outdated version (if your record is newer than your
> > NCBI taxonomy) or you can get connected with a version that is newer
> > than your record which means when you round-trip you don't get
> > complete identity.
> >
> > For compatibility across the projects some kind of consensus would
> > be good.
> >
> > - Mark
> > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>
> >>
> >> On Mar 13, 2008, at 7:13 PM, Peter wrote:
> >>
> >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>>> [...]
> >>
> >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>>> tables in a non-disruptive way, and if there weren't many changes
> >>>> shouldn't actually take that long (except that recalculating the
> >>>> nested set values may take a couple of minutes).
> >>>
> >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> >>> could write some minimal taxonomy entry (without any guess work
> >>> based
> >>> on the species name), in order to record the sequence's taxon
> >>
> >> This is what Bioperl-db does. There isn't any guesswork. If
> >> Bio::Species has lineage information it will also insert the lineage
> >> information, though.
> >>
> >>
> >>> - and then running an improved load_ncbi_taxonomy.pl at a later
> >>> date would
> >>> sort out the proper taxonomy?
> >>
> >> If I remember correctly, the script makes (and hence expects) the
> >> primary key and the NCBI taxonomy ID to be identical. If your loading
> >> procedure can achieve that already then load_ncbi_taxonomy.pl should
> >> pick them up and fix them. You can try that by loading the taxonomy
> >> through the script, then arbitrarily choose a taxon, create a stub
> >> bioentry for it and set its taxon_id foreign key to the chosen
> >> taxon,  change its taxon_name.name to some bogus value (for the
> >> 'scientific name' class, for example) (and feel free to change the
> >> left_id and right_id values in taxon too), and rerun the script. It
> >> should fix the change you made, and your bioentry should still point
> >> to the same taxon (because its primary key did not change, and did
> >> not get deleted either; otherwise the bioentry would now have a null
> >> value in the foreign key).
> >>
> >> The Bioperl-db way of storing things does not give control over
> >> primary key assignment to Bioperl-db, so the database will assign it.
> >>
> >>> [...]
> >>
> >>>> For the SymAtlas project we had this situation (new species in
> >>>> sequence updates that the last NCBI taxonomy update hadn't yet
> >>>> brought in) quite regularly. I wrote a SQL script would fix those
> >>>> 'haphazard' additions such that load_ncbi_taxonomy would update
> >>>> them
> >>>> to their correct values come the next NCBI taxonomy update. I can
> >>>> send you the script (it would be for the Oracle version), but I'm
> >>>> not
> >>>> sure this is a widely viable strategy.
> >>>
> >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
> >>
> >> No, but now that you say it I don't see any reason why I couldn't.
> >> Maybe that's just what I should do.
> >>
> >>       -hilmar
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >>
> >>
> >> BioSQL-l mailing list
> >> BioSQL-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biosql-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>



More information about the Bioperl-l mailing list