[Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage

Tue Sep 30 14:32:45 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2475

------- Comment #35 from biopython-bugzilla at maubp.freeserve.co.uk  2008-09-30 10:32 EST -------
BioSQL/BioSeqDatabase.py revision 1.18 and BioSQL/Loader.py revision 1.35 in
CVS include what I think is a working version of the BioSQL loader which can
fetch taxonomy from the NCBI via Bio.Entrez.  This is based in part on Eric's
code but includes several additional features (e.g. recording the genetic code
which the NCBI provides with the taxonomy data).

When the NCBI fetching is disabled, but an NCBI taxon ID is known, only a
minimal taxonomy record is recorded (without the lineage).  This can then be
completed by running the BioSQL load_ncbi_taxonomy.pl script.

There is still scope for improvement, e.g.

* _get_taxon_id_from_ncbi_lineage doesn't really need to be recursive.

* When there is no NCBI taxon ID present in the SeqRecord this code will not
attempt to search for the taxonomy based on the species name.  I'm not sure if
doing this search is a good idea or not...

* We could make an Entrez.efetch call for each row added to the table (rather
than as currently just one call per lineage) which should allow us to fetch the
genetic code for all the entries.  On balance I think this is not needed, and
can be populated by the BioSQL load_ncbi_taxonomy.pl script anyway.

This has passed the unit tests and my own initial testing, and I intend to use
this code a lot more this week/next week.  However, it would be great to have
some additional testing of this as is.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.