[Bioperl-l] taxonomy ID

Chris Fields cjfields at illinois.edu
Fri Apr 10 13:32:00 UTC 2009


I don't know if this has been pointed out, but Bio::DB::Taxonomy is  
also capable of indexing and using the NCBI tax flat files.

   use Bio::DB::Taxonomy;

   my $db = Bio::DB::Taxonomy->new(-source => 'flatfile'
                                  -nodesfile => $nodesfile,
                                  -namesfile => $namefile);

   # use other Bio::DB::Taxonomy methods

chris

On Apr 1, 2009, at 4:56 PM, Miguel Pignatelli wrote:

> You may find the attached Perl module useful. It solves the  
> difficult parts of getting the taxonomy given a GI identifier or a  
> taxID. It is designed to be able to process a high number of GIs  
> very fast and with low memory usage.
>
> An example of usage would be:
>
> use taxbuild;
> #Build the taxonomyDB
> my $taxDB = taxbuild‐>new(
>                                                nodes =>  
> $nodes_file_from_taxonomyDB,
>                                                names =>  
> $names_file_from_taxonomyDB,
>                                               dict  => $dictFile,
>                                               save_mem => 1
>                                 );
>
> # Get the taxonomy given a GI identifier
> my @tax = $taxDB‐>get_taxonomy_from_gi("35961124");
>
> # Get the taxonomy term of a GI identifier at a given level
> my $term_at_level = $taxDB‐ 
> >get_term_at_level_from_gi("35961124","family");
>
> # Get the taxid of a GI identifier
> my $taxid = $taxDB‐>get_taxid("35961124");
>
> # Get the taxonomy given a taxid
> my @tax = $taxDB‐>get_taxonomy($taxid);
>
> # Get the taxonomy at a given level given a taxid
> my $taxid_at_level = $taxDB‐>get_term_at_level($taxid,"genus");
>
> # Get the level of a given taxonomical name
> my $level = $taxDB‐>get_level_from_name("Proteobacteria");
>
> The "dict file" is a processed version of the gi_taxid file from  
> taxonomyDB. You can get this file by running the tax2bin2.pl script  
> also attached:
>
> $ perl tax2bin2.pl gi_taxid_prot.dmp > gi_taxid_prot.bin
> or, if you are working with genes instead of proteins:
> $ perl tax2bin2.pl gi_taxid_nucl.dmp > gi_taxid_nucl.bin
>
> A possible solution to the original post using this module would be  
> something like:
>
> # Initialize the taxonomyDB once.
> my $taxDB = taxbuild‐>new(
>                                                nodes =>  
> $nodes_file_from_taxonomyDB,
>                                                names =>  
> $names_file_from_taxonomyDB,
>                                               dict  => $dictFile,
>                                               save_mem => 1
>                                 );
>
> #For each blast result
> #Extract the GI
> my $superkingdom = $taxDB- 
> >get_term_at_level_from_gi($gi,"superkingdom");
> if ($superkingdom eq "Bacteria") {
>  # Do whatever you want
> } elsif ($superkingdom eq "Eukaryota")
>  # Do whatever you want
> }
>
>
> The module has been tested mainly in Linux systems, but should run  
> without problems in Windows and Mac too. If you encounter any  
> problem with it don't hesitate to contact me.
>
> Hope this helps,
>
> M;
>
> <tax2bin2.pl><taxbuild.pm>
>
>
>
> El 01/04/2009, a las 19:03, Florent Angly escribió:
>
>> FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that  
>> you won't be able to put its information in a hash (unless you have  
>> a lot of memory).
>> Florent
>>
>> Smithies, Russell wrote:
>>> The taxonomy information isn't in the blast output unless you  
>>> created custom fasta headers for your blast database.
>>> The easiest way to get the tax_id for your accessions would be to  
>>> download the gi->tax_id list from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz 
>>> .
>>> If you load that file into a hash, parse the accessions out of the  
>>> blast hits then lookup the tax_id from that hash, I think it  
>>> should be fairly fast.
>>> Checking which are prokaryotes and which are eukaryotes based on  
>>> tax_id is a separate problem  :-)
>>> If you grab the taxdump.tar.gz file from the same site, the  
>>> nodes.dmp file contained within lists what division each tax_id  
>>> belongs to (Bacteria, Invertebrates, Mammals, Phages, Plants, etc)  
>>> so you can probably work it out from that.
>>>
>>> It's not a very BioPerly solution but sometimes just looking up  
>>> the answer from a file/table/hash is the simplest way.
>>> Hope this helps,
>>>
>>> Russell Smithies
>>> Bioinformatics Applications Developer T +64 3 489 9085 E  russell.smithies at agresearch.co.nz
>>> Invermay  Research Centre Puddle Alley, Mosgiel, New Zealand T   
>>> +64 3 489 3809   F  +64 3 489 9174  www.agresearch.co.nz
>>>
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of shalabh sharma
>>>> Sent: Wednesday, 1 April 2009 7:43 a.m.
>>>> To: bioperl-l
>>>> Subject: [Bioperl-l] taxonomy ID
>>>>
>>>> Hi All,
>>>>         I am writing a script, for one of its part i have to  
>>>> parse a blast
>>>> report (refseq blast) and check how may organisms are eukaryotes  
>>>> and how
>>>> namy of them are prokaryotes.
>>>> I am using BIO::DB::taxinomy module:
>>>> http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy
>>>>
>>>> But for this i need a taxonomyid (like '33090') given in the  
>>>> example.
>>>> So is it possible to get a taxonomyid from refseq balst report?
>>>> If not then how i can deal with this problem?
>>>>
>>>> i would really appreciate if anyone can help me out.
>>>>
>>>> Thanks
>>>> Shalabh
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Attention: The information contained in this message and/or  
>>> attachments
>>> from AgResearch Limited is intended only for the persons or entities
>>> to which it is addressed and may contain confidential and/or  
>>> privileged
>>> material. Any review, retransmission, dissemination or other use  
>>> of, or
>>> taking of any action in reliance upon, this information by persons  
>>> or
>>> entities other than the intended recipients is prohibited by  
>>> AgResearch
>>> Limited. If you have received this message in error, please notify  
>>> the
>>> sender immediately.
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list