[Bioperl-l] taxonomy ID

Miguel Pignatelli miguel.pignatelli at uv.es
Wed Apr 1 21:56:36 UTC 2009


You may find the attached Perl module useful. It solves the difficult  
parts of getting the taxonomy given a GI identifier or a taxID. It is  
designed to be able to process a high number of GIs very fast and with  
low memory usage.

An example of usage would be:

use taxbuild;
#Build the taxonomyDB
my $taxDB = taxbuild‐>new(
                                                 nodes =>  
$nodes_file_from_taxonomyDB,
                                                 names =>  
$names_file_from_taxonomyDB,
                                                dict  => $dictFile,
                                                save_mem => 1
                                  );

# Get the taxonomy given a GI identifier
my @tax = $taxDB‐>get_taxonomy_from_gi("35961124");

# Get the taxonomy term of a GI identifier at a given level
my $term_at_level = $taxDB‐ 
 >get_term_at_level_from_gi("35961124","family");

# Get the taxid of a GI identifier
my $taxid = $taxDB‐>get_taxid("35961124");

# Get the taxonomy given a taxid
my @tax = $taxDB‐>get_taxonomy($taxid);

# Get the taxonomy at a given level given a taxid
my $taxid_at_level = $taxDB‐>get_term_at_level($taxid,"genus");

# Get the level of a given taxonomical name
my $level = $taxDB‐>get_level_from_name("Proteobacteria");

The "dict file" is a processed version of the gi_taxid file from  
taxonomyDB. You can get this file by running the tax2bin2.pl script  
also attached:

$ perl tax2bin2.pl gi_taxid_prot.dmp > gi_taxid_prot.bin
or, if you are working with genes instead of proteins:
$ perl tax2bin2.pl gi_taxid_nucl.dmp > gi_taxid_nucl.bin

A possible solution to the original post using this module would be  
something like:

# Initialize the taxonomyDB once.
my $taxDB = taxbuild‐>new(
                                                 nodes =>  
$nodes_file_from_taxonomyDB,
                                                 names =>  
$names_file_from_taxonomyDB,
                                                dict  => $dictFile,
                                                save_mem => 1
                                  );

#For each blast result
#Extract the GI
my $superkingdom = $taxDB- 
 >get_term_at_level_from_gi($gi,"superkingdom");
if ($superkingdom eq "Bacteria") {
   # Do whatever you want
} elsif ($superkingdom eq "Eukaryota")
   # Do whatever you want
}


The module has been tested mainly in Linux systems, but should run  
without problems in Windows and Mac too. If you encounter any problem  
with it don't hesitate to contact me.

Hope this helps,

M;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tax2bin2.pl
Type: text/x-perl-script
Size: 400 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20090401/4ba27276/attachment-0008.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: taxbuild.pm
Type: text/x-perl-script
Size: 10599 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20090401/4ba27276/attachment-0009.bin>
-------------- next part --------------




El 01/04/2009, a las 19:03, Florent Angly escribió:

> FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that  
> you won't be able to put its information in a hash (unless you have  
> a lot of memory).
> Florent
>
> Smithies, Russell wrote:
>> The taxonomy information isn't in the blast output unless you  
>> created custom fasta headers for your blast database.
>> The easiest way to get the tax_id for your accessions would be to  
>> download the gi->tax_id list from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz 
>> .
>> If you load that file into a hash, parse the accessions out of the  
>> blast hits then lookup the tax_id from that hash, I think it should  
>> be fairly fast.
>> Checking which are prokaryotes and which are eukaryotes based on  
>> tax_id is a separate problem  :-)
>> If you grab the taxdump.tar.gz file from the same site, the  
>> nodes.dmp file contained within lists what division each tax_id  
>> belongs to (Bacteria, Invertebrates, Mammals, Phages, Plants, etc)  
>> so you can probably work it out from that.
>>
>> It's not a very BioPerly solution but sometimes just looking up the  
>> answer from a file/table/hash is the simplest way.
>> Hope this helps,
>>
>> Russell Smithies
>> Bioinformatics Applications Developer T +64 3 489 9085 E  russell.smithies at agresearch.co.nz
>> Invermay  Research Centre Puddle Alley, Mosgiel, New Zealand T  +64  
>> 3 489 3809   F  +64 3 489 9174  www.agresearch.co.nz
>>
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>> bounces at lists.open-bio.org] On Behalf Of shalabh sharma
>>> Sent: Wednesday, 1 April 2009 7:43 a.m.
>>> To: bioperl-l
>>> Subject: [Bioperl-l] taxonomy ID
>>>
>>> Hi All,
>>>          I am writing a script, for one of its part i have to  
>>> parse a blast
>>> report (refseq blast) and check how may organisms are eukaryotes  
>>> and how
>>> namy of them are prokaryotes.
>>> I am using BIO::DB::taxinomy module:
>>> http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy
>>>
>>> But for this i need a taxonomyid (like '33090') given in the  
>>> example.
>>> So is it possible to get a taxonomyid from refseq balst report?
>>> If not then how i can deal with this problem?
>>>
>>> i would really appreciate if anyone can help me out.
>>>
>>> Thanks
>>> Shalabh
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> = 
>> = 
>> =====================================================================
>> Attention: The information contained in this message and/or  
>> attachments
>> from AgResearch Limited is intended only for the persons or entities
>> to which it is addressed and may contain confidential and/or  
>> privileged
>> material. Any review, retransmission, dissemination or other use  
>> of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipients is prohibited by  
>> AgResearch
>> Limited. If you have received this message in error, please notify  
>> the
>> sender immediately.
>> = 
>> = 
>> =====================================================================
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list