[Bioperl-l] Bio::*Taxonomy* changes
bix at sendu.me.uk
Tue Jul 18 13:20:34 UTC 2006
I thought I'd post this here incase anyone wants to discuss the points
Nadeem brings up. As far as I can see it is acceptable to remove the <>
bits so I still plan to do so.
Nadeem Faruque wrote: [off-list, posted here with permission]
> In case you didn't realise, odd node names such as 'Gnathostomata
> <vertebrate>' are created to uniquify some tax nodes that have identical
> scientific names, eg there are 8 entries for Rhodotorula.
> When we parse the ncbi tax dump we store this column as UNIQUE_NAME but
> I don't think that we actually use it for anything at within EMBL
> nucleotide sequence bank.
> Also, I note that there are 548 non-unique NAME_TXT of class 'scientific
> name', so the UNIQUE_NAME column may be of use to someone (though given
> the strength of using a taxid directly I don't see why you'd want to).
Indeed. And given that we are building a taxonomy with nodes, it doesn't
matter that two different nodes in the entire taxonomy tree share the
same name - the position in the tree implicitly is something unique. So
if you find yourself with a node called 'Rhodotorula' you can find out
which one it is by looking at the closest ranked parent.
That said, for 'Rhodotorula <Sporidiobolaceae>' the closest ranked
parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a
problem? Do we need to care about this word 'Sporidiobolaceae' that is
effectively just a synonym of 'Sporidiobolales'?
[Nadeem later replied "...I can't imagine the <> value to be of any
use.". He also clarified that if species have identical names and you
store those, you can't work out what the corresponding taxid is. Without
the <> bit you need some other information, like the classification. I
think this other information will be present in input file formats and
it must be up to the user to store the extra when outputting from bioperl]
More information about the Bioperl-l