[BioSQL-l] left_value and right_value in taxon table

Peter biopython at maubp.freeserve.co.uk
Wed Apr 9 10:02:27 UTC 2008


On Wed, Apr 9, 2008 at 12:57 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  The [load_ncbi_taxonomy.pl] script does recompute *all*
>  nested set values, though...
>  Though you can have very cheap cases indeed, in reality it
>  turns out that on average you still need to traverse and update
>  at least half of the nodes, so personally I really doubt you
>  would save any significant amount of time by not just redoing
>  all of them. And it's not that time-intensive either;  typically it
>  takes about 10-20mins, depending on CPU etc.

This does mean that in general, trying to fully update the taxon table
when adding a new sequence with a novel NCBI taxon id would take at
least 10mins (in addition to the drawback of having the Bio* project
reimplement much of the load_ncbi_taxonomy.pl script's logic).

This probably helps explain why when the NCBI taxon ID wasn't already
defined, the old Biopython code would actually create new taxon table
entries for the entire lineage (based on the species lineage names in
a GenBank file) without linking into any existing taxon table entries
which may have matched.  Because these new entries were independent of
everything else, their left/right values could be calculated trivially
(starting above the largest existing left/right value). This had the
advantage of recording as much information as possible (without having
to use load_ncbi_taxonomy.pl at all), but left the taxon table full of
redundant entries.

I think that in this case, when trying to load a sequence with a novel
NCBI taxon id, the best solution may be just to add a single minimal
taxon table entry with NULL left/right values (and let the
load_ncbi_taxonomy.pl fill in the lineage later).

Peter



More information about the BioSQL-l mailing list