[Bioperl-l] taxonomy ID
Jason Stajich
jason at bioperl.org
Fri Apr 10 15:51:45 UTC 2009
The only difference to the DB::Taxonomy module I can see is we don't
specifically have the dictionary part -- for gi -> taxid, but I just
do a local DBHash index of that when I need it.
-jason
On Apr 10, 2009, at 6:32 AM, Chris Fields wrote:
> I don't know if this has been pointed out, but Bio::DB::Taxonomy is
> also capable of indexing and using the NCBI tax flat files.
>
> use Bio::DB::Taxonomy;
>
> my $db = Bio::DB::Taxonomy->new(-source => 'flatfile'
> -nodesfile => $nodesfile,
> -namesfile => $namefile);
>
> # use other Bio::DB::Taxonomy methods
>
> chris
>
> On Apr 1, 2009, at 4:56 PM, Miguel Pignatelli wrote:
>
>> You may find the attached Perl module useful. It solves the
>> difficult parts of getting the taxonomy given a GI identifier or a
>> taxID. It is designed to be able to process a high number of GIs
>> very fast and with low memory usage.
>>
>> An example of usage would be:
>>
>> use taxbuild;
>> #Build the taxonomyDB
>> my $taxDB = taxbuild‐>new(
>> nodes =>
>> $nodes_file_from_taxonomyDB,
>> names =>
>> $names_file_from_taxonomyDB,
>> dict => $dictFile,
>> save_mem => 1
>> );
>>
>> # Get the taxonomy given a GI identifier
>> my @tax = $taxDB‐>get_taxonomy_from_gi("35961124");
>>
>> # Get the taxonomy term of a GI identifier at a given level
>> my $term_at_level = $taxDB‐
>> >get_term_at_level_from_gi("35961124","family");
>>
>> # Get the taxid of a GI identifier
>> my $taxid = $taxDB‐>get_taxid("35961124");
>>
>> # Get the taxonomy given a taxid
>> my @tax = $taxDB‐>get_taxonomy($taxid);
>>
>> # Get the taxonomy at a given level given a taxid
>> my $taxid_at_level = $taxDB‐>get_term_at_level($taxid,"genus");
>>
>> # Get the level of a given taxonomical name
>> my $level = $taxDB‐>get_level_from_name("Proteobacteria");
>>
>> The "dict file" is a processed version of the gi_taxid file from
>> taxonomyDB. You can get this file by running the tax2bin2.pl script
>> also attached:
>>
>> $ perl tax2bin2.pl gi_taxid_prot.dmp > gi_taxid_prot.bin
>> or, if you are working with genes instead of proteins:
>> $ perl tax2bin2.pl gi_taxid_nucl.dmp > gi_taxid_nucl.bin
>>
>> A possible solution to the original post using this module would be
>> something like:
>>
>> # Initialize the taxonomyDB once.
>> my $taxDB = taxbuild‐>new(
>> nodes =>
>> $nodes_file_from_taxonomyDB,
>> names =>
>> $names_file_from_taxonomyDB,
>> dict => $dictFile,
>> save_mem => 1
>> );
>>
>> #For each blast result
>> #Extract the GI
>> my $superkingdom = $taxDB-
>> >get_term_at_level_from_gi($gi,"superkingdom");
>> if ($superkingdom eq "Bacteria") {
>> # Do whatever you want
>> } elsif ($superkingdom eq "Eukaryota")
>> # Do whatever you want
>> }
>>
>>
>> The module has been tested mainly in Linux systems, but should run
>> without problems in Windows and Mac too. If you encounter any
>> problem with it don't hesitate to contact me.
>>
>> Hope this helps,
>>
>> M;
>>
>> <tax2bin2.pl><taxbuild.pm>
>>
>>
>>
>> El 01/04/2009, a las 19:03, Florent Angly escribió:
>>
>>> FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that
>>> you won't be able to put its information in a hash (unless you
>>> have a lot of memory).
>>> Florent
>>>
>>> Smithies, Russell wrote:
>>>> The taxonomy information isn't in the blast output unless you
>>>> created custom fasta headers for your blast database.
>>>> The easiest way to get the tax_id for your accessions would be to
>>>> download the gi->tax_id list from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
>>>> .
>>>> If you load that file into a hash, parse the accessions out of
>>>> the blast hits then lookup the tax_id from that hash, I think it
>>>> should be fairly fast.
>>>> Checking which are prokaryotes and which are eukaryotes based on
>>>> tax_id is a separate problem :-)
>>>> If you grab the taxdump.tar.gz file from the same site, the
>>>> nodes.dmp file contained within lists what division each tax_id
>>>> belongs to (Bacteria, Invertebrates, Mammals, Phages, Plants,
>>>> etc) so you can probably work it out from that.
>>>>
>>>> It's not a very BioPerly solution but sometimes just looking up
>>>> the answer from a file/table/hash is the simplest way.
>>>> Hope this helps,
>>>>
>>>> Russell Smithies
>>>> Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies at agresearch.co.nz
>>>> Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T
>>>> +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>>> bounces at lists.open-bio.org] On Behalf Of shalabh sharma
>>>>> Sent: Wednesday, 1 April 2009 7:43 a.m.
>>>>> To: bioperl-l
>>>>> Subject: [Bioperl-l] taxonomy ID
>>>>>
>>>>> Hi All,
>>>>> I am writing a script, for one of its part i have to
>>>>> parse a blast
>>>>> report (refseq blast) and check how may organisms are eukaryotes
>>>>> and how
>>>>> namy of them are prokaryotes.
>>>>> I am using BIO::DB::taxinomy module:
>>>>> http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy
>>>>>
>>>>> But for this i need a taxonomyid (like '33090') given in the
>>>>> example.
>>>>> So is it possible to get a taxonomyid from refseq balst report?
>>>>> If not then how i can deal with this problem?
>>>>>
>>>>> i would really appreciate if anyone can help me out.
>>>>>
>>>>> Thanks
>>>>> Shalabh
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>> =
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>> Attention: The information contained in this message and/or
>>>> attachments
>>>> from AgResearch Limited is intended only for the persons or
>>>> entities
>>>> to which it is addressed and may contain confidential and/or
>>>> privileged
>>>> material. Any review, retransmission, dissemination or other use
>>>> of, or
>>>> taking of any action in reliance upon, this information by
>>>> persons or
>>>> entities other than the intended recipients is prohibited by
>>>> AgResearch
>>>> Limited. If you have received this message in error, please
>>>> notify the
>>>> sender immediately.
>>>> =
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Jason Stajich
jason at bioperl.org
More information about the Bioperl-l
mailing list