[BioSQL-l] Loading sequences with novel NCBI taxon id

Thu Mar 13 23:41:43 UTC 2008

On Mar 13, 2008, at 7:13 PM, Peter wrote:

> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>> [...]
>>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>>  tables in a non-disruptive way, and if there weren't many changes
>>  shouldn't actually take that long (except that recalculating the
>>  nested set values may take a couple of minutes).
>
> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> could write some minimal taxonomy entry (without any guess work based
> on the species name), in order to record the sequence's taxon

This is what Bioperl-db does. There isn't any guesswork. If  
Bio::Species has lineage information it will also insert the lineage  
information, though.

> - and then running an improved load_ncbi_taxonomy.pl at a later  
> date would
> sort out the proper taxonomy?

If I remember correctly, the script makes (and hence expects) the  
primary key and the NCBI taxonomy ID to be identical. If your loading  
procedure can achieve that already then load_ncbi_taxonomy.pl should  
pick them up and fix them. You can try that by loading the taxonomy  
through the script, then arbitrarily choose a taxon, create a stub  
bioentry for it and set its taxon_id foreign key to the chosen  
taxon,  change its taxon_name.name to some bogus value (for the  
'scientific name' class, for example) (and feel free to change the  
left_id and right_id values in taxon too), and rerun the script. It  
should fix the change you made, and your bioentry should still point  
to the same taxon (because its primary key did not change, and did  
not get deleted either; otherwise the bioentry would now have a null  
value in the foreign key).

The Bioperl-db way of storing things does not give control over  
primary key assignment to Bioperl-db, so the database will assign it.

> [...]
>>  For the SymAtlas project we had this situation (new species in
>>  sequence updates that the last NCBI taxonomy update hadn't yet
>>  brought in) quite regularly. I wrote a SQL script would fix those
>>  'haphazard' additions such that load_ncbi_taxonomy would update them
>>  to their correct values come the next NCBI taxonomy update. I can
>>  send you the script (it would be for the Oracle version), but I'm  
>> not
>>  sure this is a widely viable strategy.
>
> So this wasn't integrated with load_ncbi_taxonomy.pl at all?

No, but now that you say it I don't see any reason why I couldn't.  
Maybe that's just what I should do.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================