[Bioperl-l] [BioSQL-l] Loading sequences with novel NCBI taxon id

Fri Mar 14 14:31:09 UTC 2008

The counter to that perspective (using new sequences with old tax  
info) would be to regularly update NCBI taxonomy, particularly in  
circumstances prior to adding new sequences.  Hilmar mentioned that  
once tax is loaded it doesn't take as long to update, so you could set  
up a cron job to update regularly.

I remember someone mentioning weekly or monthly updates on the list  
quite a while ago, but I'm unsure how often NCBI updates tax  
information (i.e. with every release, monthly, weekly, etc).  I can  
see instances popping up where you used the an up-to-date taxonomy but  
a new sequence contains a tax ID not present.  I think bioperl-db  
handles these but I'm not sure what other Bio* do.

chris

On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:

>> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.
>
> This has two curious side effects. Because the details associated with
> a taxid sometimes change (eg common name changes a lot) you can get
> connected to an outdated version (if your record is newer than your
> NCBI taxonomy) or you can get connected with a version that is newer
> than your record which means when you round-trip you don't get
> complete identity.
>
> For compatibility across the projects some kind of consensus would  
> be good.
>
> - Mark
> On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>>
>> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>>
>>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>>> [...]
>>
>>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
>>>> tables in a non-disruptive way, and if there weren't many changes
>>>> shouldn't actually take that long (except that recalculating the
>>>> nested set values may take a couple of minutes).
>>>
>>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>>> could write some minimal taxonomy entry (without any guess work  
>>> based
>>> on the species name), in order to record the sequence's taxon
>>
>> This is what Bioperl-db does. There isn't any guesswork. If
>> Bio::Species has lineage information it will also insert the lineage
>> information, though.
>>
>>
>>> - and then running an improved load_ncbi_taxonomy.pl at a later
>>> date would
>>> sort out the proper taxonomy?
>>
>> If I remember correctly, the script makes (and hence expects) the
>> primary key and the NCBI taxonomy ID to be identical. If your loading
>> procedure can achieve that already then load_ncbi_taxonomy.pl should
>> pick them up and fix them. You can try that by loading the taxonomy
>> through the script, then arbitrarily choose a taxon, create a stub
>> bioentry for it and set its taxon_id foreign key to the chosen
>> taxon,  change its taxon_name.name to some bogus value (for the
>> 'scientific name' class, for example) (and feel free to change the
>> left_id and right_id values in taxon too), and rerun the script. It
>> should fix the change you made, and your bioentry should still point
>> to the same taxon (because its primary key did not change, and did
>> not get deleted either; otherwise the bioentry would now have a null
>> value in the foreign key).
>>
>> The Bioperl-db way of storing things does not give control over
>> primary key assignment to Bioperl-db, so the database will assign it.
>>
>>> [...]
>>
>>>> For the SymAtlas project we had this situation (new species in
>>>> sequence updates that the last NCBI taxonomy update hadn't yet
>>>> brought in) quite regularly. I wrote a SQL script would fix those
>>>> 'haphazard' additions such that load_ncbi_taxonomy would update  
>>>> them
>>>> to their correct values come the next NCBI taxonomy update. I can
>>>> send you the script (it would be for the Oracle version), but I'm
>>>> not
>>>> sure this is a widely viable strategy.
>>>
>>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>>
>> No, but now that you say it I don't see any reason why I couldn't.
>> Maybe that's just what I should do.
>>
>>       -hilmar
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>> _______________________________________________
>>
>>
>>
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign