[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Tue Jul 18 13:20:34 UTC 2006


I thought I'd post this here incase anyone wants to discuss the points 
Nadeem brings up. As far as I can see it is acceptable to remove the <> 
bits so I still plan to do so.

Nadeem Faruque wrote: [off-list, posted here with permission]
> In case you didn't realise, odd node names such as 'Gnathostomata 
> <vertebrate>' are created to uniquify some tax nodes that have identical 
> scientific names, eg there are 8 entries for Rhodotorula.
> 
> When we parse the ncbi tax dump we store this column as UNIQUE_NAME but 
> I don't think that we actually use it for anything at within EMBL 
> nucleotide sequence bank.
[...]
> Also, I note that there are 548 non-unique NAME_TXT of class 'scientific 
> name', so the UNIQUE_NAME column may be of use to someone (though given 
> the strength of using a taxid directly I don't see why you'd want to).

Indeed. And given that we are building a taxonomy with nodes, it doesn't
matter that two different nodes in the entire taxonomy tree share the
same name - the position in the tree implicitly is something unique. So
if you find yourself with a node called 'Rhodotorula' you can find out
which one it is by looking at the closest ranked parent.

That said, for 'Rhodotorula <Sporidiobolaceae>' the closest ranked
parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a
problem? Do we need to care about this word 'Sporidiobolaceae' that is
effectively just a synonym of 'Sporidiobolales'?

[Nadeem later replied "...I can't imagine the <> value to be of any 
use.". He also clarified that if species have identical names and you 
store those, you can't work out what the corresponding taxid is. Without 
the <> bit you need some other information, like the classification. I 
think this other information will be present in input file formats and 
it must be up to the user to store the extra when outputting from bioperl]



More information about the Bioperl-l mailing list