[Bioperl-l] taxonomy ID

Fri Apr 10 15:51:45 UTC 2009

The only difference to the DB::Taxonomy module I can see is we don't  
specifically have the dictionary part -- for gi -> taxid, but I just  
do a local DBHash index of that when I need it.
-jason
On Apr 10, 2009, at 6:32 AM, Chris Fields wrote:

> I don't know if this has been pointed out, but Bio::DB::Taxonomy is  
> also capable of indexing and using the NCBI tax flat files.
>
>  use Bio::DB::Taxonomy;
>
>  my $db = Bio::DB::Taxonomy->new(-source => 'flatfile'
>                                 -nodesfile => $nodesfile,
>                                 -namesfile => $namefile);
>
>  # use other Bio::DB::Taxonomy methods
>
> chris
>
> On Apr 1, 2009, at 4:56 PM, Miguel Pignatelli wrote:
>
>> You may find the attached Perl module useful. It solves the  
>> difficult parts of getting the taxonomy given a GI identifier or a  
>> taxID. It is designed to be able to process a high number of GIs  
>> very fast and with low memory usage.
>>
>> An example of usage would be:
>>
>> use taxbuild;
>> #Build the taxonomyDB
>> my $taxDB = taxbuild‐>new(
>>                                               nodes =>  
>> $nodes_file_from_taxonomyDB,
>>                                               names =>  
>> $names_file_from_taxonomyDB,
>>                                              dict  => $dictFile,
>>                                              save_mem => 1
>>                                );
>>
>> # Get the taxonomy given a GI identifier
>> my @tax = $taxDB‐>get_taxonomy_from_gi("35961124");
>>
>> # Get the taxonomy term of a GI identifier at a given level
>> my $term_at_level = $taxDB‐ 
>> >get_term_at_level_from_gi("35961124","family");
>>
>> # Get the taxid of a GI identifier
>> my $taxid = $taxDB‐>get_taxid("35961124");
>>
>> # Get the taxonomy given a taxid
>> my @tax = $taxDB‐>get_taxonomy($taxid);
>>
>> # Get the taxonomy at a given level given a taxid
>> my $taxid_at_level = $taxDB‐>get_term_at_level($taxid,"genus");
>>
>> # Get the level of a given taxonomical name
>> my $level = $taxDB‐>get_level_from_name("Proteobacteria");
>>
>> The "dict file" is a processed version of the gi_taxid file from  
>> taxonomyDB. You can get this file by running the tax2bin2.pl script  
>> also attached:
>>
>> $ perl tax2bin2.pl gi_taxid_prot.dmp > gi_taxid_prot.bin
>> or, if you are working with genes instead of proteins:
>> $ perl tax2bin2.pl gi_taxid_nucl.dmp > gi_taxid_nucl.bin
>>
>> A possible solution to the original post using this module would be  
>> something like:
>>
>> # Initialize the taxonomyDB once.
>> my $taxDB = taxbuild‐>new(
>>                                               nodes =>  
>> $nodes_file_from_taxonomyDB,
>>                                               names =>  
>> $names_file_from_taxonomyDB,
>>                                              dict  => $dictFile,
>>                                              save_mem => 1
>>                                );
>>
>> #For each blast result
>> #Extract the GI
>> my $superkingdom = $taxDB- 
>> >get_term_at_level_from_gi($gi,"superkingdom");
>> if ($superkingdom eq "Bacteria") {
>> # Do whatever you want
>> } elsif ($superkingdom eq "Eukaryota")
>> # Do whatever you want
>> }
>>
>>
>> The module has been tested mainly in Linux systems, but should run  
>> without problems in Windows and Mac too. If you encounter any  
>> problem with it don't hesitate to contact me.
>>
>> Hope this helps,
>>
>> M;
>>
>> <tax2bin2.pl><taxbuild.pm>
>>
>>
>>
>> El 01/04/2009, a las 19:03, Florent Angly escribió:
>>
>>> FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that  
>>> you won't be able to put its information in a hash (unless you  
>>> have a lot of memory).
>>> Florent
>>>
>>> Smithies, Russell wrote:
>>>> The taxonomy information isn't in the blast output unless you  
>>>> created custom fasta headers for your blast database.
>>>> The easiest way to get the tax_id for your accessions would be to  
>>>> download the gi->tax_id list from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz 
>>>> .
>>>> If you load that file into a hash, parse the accessions out of  
>>>> the blast hits then lookup the tax_id from that hash, I think it  
>>>> should be fairly fast.
>>>> Checking which are prokaryotes and which are eukaryotes based on  
>>>> tax_id is a separate problem  :-)
>>>> If you grab the taxdump.tar.gz file from the same site, the  
>>>> nodes.dmp file contained within lists what division each tax_id  
>>>> belongs to (Bacteria, Invertebrates, Mammals, Phages, Plants,  
>>>> etc) so you can probably work it out from that.
>>>>
>>>> It's not a very BioPerly solution but sometimes just looking up  
>>>> the answer from a file/table/hash is the simplest way.
>>>> Hope this helps,
>>>>
>>>> Russell Smithies
>>>> Bioinformatics Applications Developer T +64 3 489 9085 E  russell.smithies at agresearch.co.nz
>>>> Invermay  Research Centre Puddle Alley, Mosgiel, New Zealand T   
>>>> +64 3 489 3809   F  +64 3 489 9174  www.agresearch.co.nz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>>> bounces at lists.open-bio.org] On Behalf Of shalabh sharma
>>>>> Sent: Wednesday, 1 April 2009 7:43 a.m.
>>>>> To: bioperl-l
>>>>> Subject: [Bioperl-l] taxonomy ID
>>>>>
>>>>> Hi All,
>>>>>        I am writing a script, for one of its part i have to  
>>>>> parse a blast
>>>>> report (refseq blast) and check how may organisms are eukaryotes  
>>>>> and how
>>>>> namy of them are prokaryotes.
>>>>> I am using BIO::DB::taxinomy module:
>>>>> http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy
>>>>>
>>>>> But for this i need a taxonomyid (like '33090') given in the  
>>>>> example.
>>>>> So is it possible to get a taxonomyid from refseq balst report?
>>>>> If not then how i can deal with this problem?
>>>>>
>>>>> i would really appreciate if anyone can help me out.
>>>>>
>>>>> Thanks
>>>>> Shalabh
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> ===================================================================
>>>> Attention: The information contained in this message and/or  
>>>> attachments
>>>> from AgResearch Limited is intended only for the persons or  
>>>> entities
>>>> to which it is addressed and may contain confidential and/or  
>>>> privileged
>>>> material. Any review, retransmission, dissemination or other use  
>>>> of, or
>>>> taking of any action in reliance upon, this information by  
>>>> persons or
>>>> entities other than the intended recipients is prohibited by  
>>>> AgResearch
>>>> Limited. If you have received this message in error, please  
>>>> notify the
>>>> sender immediately.
>>>> = 
>>>> = 
>>>> = 
>>>> = 
>>>> ===================================================================
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org