[Bioperl-l] taxonomy ID

Thu Apr 2 08:17:02 UTC 2009

You may find the attached Perl module useful. It solves the difficult 
parts of getting the taxonomy given a GI identifier or a taxID. It is 
designed to be able to process a high number of GIs very fast and with 
low memory usage.

An example of usage would be:

use taxbuild;
#Build the taxonomyDB
my $taxDB = taxbuild->new(
    nodes => $nodes_file_from_taxonomyDB,
    names => $names_file_from_taxonomyDB,
    dict  => $dictFile,
    save_mem => 1
  );

# Get the taxonomy given a GI identifier
my @tax = $taxDB->get_taxonomy_from_gi("35961124");

# Get the taxonomy term of a GI identifier at a given level
my $term_at_level = taxDB->get_term_at_level_from_gi("35961124","family");

# Get the taxid of a GI identifier
my $taxid = $taxDB->get_taxid("35961124");

# Get the taxonomy given a taxid
my @tax = $taxDB->get_taxonomy($taxid);

# Get the taxonomy at a given level given a taxid
my $taxid_at_level = $taxDB->get_term_at_level($taxid,"genus");

# Get the level of a given taxonomical name
my $level = $taxDB->get_level_from_name("Proteobacteria");

The "dict file" is a processed version of the gi_taxid file from 
taxonomyDB. You can get this file by running the tax2bin2.pl script also 
attached:

$ perl tax2bin2.pl gi_taxid_prot.dmp > gi_taxid_prot.bin
or, if you are working with genes instead of proteins:
$ perl tax2bin2.pl gi_taxid_nucl.dmp > gi_taxid_nucl.bin

You may consult the documentation of the module for a full description.

A possible solution to the original post using this module would be 
something like:

# Initialize the taxonomyDB once.
my $taxDB = taxbuild->new(
   nodes => $nodes_file_from_taxonomyDB,
   names => $names_file_from_taxonomyDB,
   dict  => $dictFile,
   save_mem => 1
);

#For each GI in your blast result:
my $superkingdom = $taxDB->get_term_at_level_from_gi($gi,"superkingdom");
if ($superkingdom eq "Bacteria") {
   # Do whatever you want
} elsif ($superkingdom eq "Eukaryota")
   # Do whatever you want
}

The module has been tested mainly in Linux systems, but should run 
without problems in Windows and Mac too. If you encounter any problem 
while using it don't hesitate to contact me.

Hope this helps,

M;

Florent Angly wrote:
> FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that you 
> won't be able to put its information in a hash (unless you have a lot of 
> memory).
> Florent
> 
> Smithies, Russell wrote:
>> The taxonomy information isn't in the blast output unless you created 
>> custom fasta headers for your blast database.
>> The easiest way to get the tax_id for your accessions would be to 
>> download the gi->tax_id list from 
>> ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz.
>> If you load that file into a hash, parse the accessions out of the 
>> blast hits then lookup the tax_id from that hash, I think it should be 
>> fairly fast.
>> Checking which are prokaryotes and which are eukaryotes based on 
>> tax_id is a separate problem  :-)
>> If you grab the taxdump.tar.gz file from the same site, the nodes.dmp 
>> file contained within lists what division each tax_id belongs to 
>> (Bacteria, Invertebrates, Mammals, Phages, Plants, etc) so you can 
>> probably work it out from that.
>>
>> It's not a very BioPerly solution but sometimes just looking up the 
>> answer from a file/table/hash is the simplest way.
>> Hope this helps,
>>
>> Russell Smithies
>> Bioinformatics Applications Developer T +64 3 489 9085 E  
>> russell.smithies at agresearch.co.nz
>> Invermay  Research Centre Puddle Alley, Mosgiel, New Zealand T  +64 3 
>> 489 3809   F  +64 3 489 9174  www.agresearch.co.nz
>>
>>
>>
>>
>>  
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>> bounces at lists.open-bio.org] On Behalf Of shalabh sharma
>>> Sent: Wednesday, 1 April 2009 7:43 a.m.
>>> To: bioperl-l
>>> Subject: [Bioperl-l] taxonomy ID
>>>
>>> Hi All,
>>>           I am writing a script, for one of its part i have to parse 
>>> a blast
>>> report (refseq blast) and check how may organisms are eukaryotes and how
>>> namy of them are prokaryotes.
>>> I am using BIO::DB::taxinomy module:
>>> http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy
>>>
>>> But for this i need a taxonomyid (like '33090') given in the example.
>>> So is it possible to get a taxonomyid from refseq balst report?
>>> If not then how i can deal with this problem?
>>>
>>> i would really appreciate if anyone can help me out.
>>>
>>> Thanks
>>> Shalabh
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>     
>> =======================================================================
>> Attention: The information contained in this message and/or attachments
>> from AgResearch Limited is intended only for the persons or entities
>> to which it is addressed and may contain confidential and/or privileged
>> material. Any review, retransmission, dissemination or other use of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipients is prohibited by AgResearch
>> Limited. If you have received this message in error, please notify the
>> sender immediately.
>> =======================================================================
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>   
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>