[Bioperl-l] can't retrieve description using Bio::DB::EntrezGene
jmi k
bi_my_heart at hotmail.com
Sun Jul 17 02:35:56 EDT 2011
Carnë,
Thank you for your reply.
Actually, I have pasted my code with proper formatting, but it seems like hotmail didn't do a good job of keeping those newlines :P I was afraid the code to my whole program would be too long to read. Anyway, sorry for any inconvenience caused.
After reading your reply, I tried my code with a different gene id (2), and realized that desc() does retrieve something, but it's not what I expected. Specifically, I want to retrieve the gene description from this entrezgene record, along with other genes from M. smegmatis: (hope it shows properly...or please see http://www.ncbi.nlm.nih.gov/gene?term=4535615)
dnaN DNA polymerase III subunit beta[Mycobacterium smegmatis str. MC2 155]
Gene ID: 4535615, created on 30-Nov-2006
SUMMARY
-------------------------------------------------------------------------------------------------
Gene description: DNA polymerase III subunit beta
Locus tag: MSMEG_0001
Gene type: protein coding
RefSeq status: VALIDATED
Organism: Mycobacterium smegmatis str. MC2 155
Lineage: Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae;
Mycobacteriaceae; Mycobacterium
(In my previous email, I used gene id 1234567 because I substituted a variable to avoid confusion, as I didn't paste the entire code.)I discovered by using gene id 2 that desc() retrieves the Summary (under the bigger SUMMARY heading) entry of entrezgene records:
SUMMARY
-------------------------------------------------------------------------------------------------
Official Symbol: A2M (provided by HGNC)
Official full name: alpha-2-macroglobulin (provided by HGNC)
Primary source: HGNC:7
See related: Ensembl:ENSG00000175899; HPRD:00072; MIM:103950
Gene type: protein coding
RefSeq status: REVIEWED
Organism: Homo sapiens
Lineage: Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo
Also known as: CPAMD5; FWP007; S863-7; DKFZp779B086
Summary: Alpha-2-macroglobulin is a protease inhibitor and cytokine transporter. It
inhibits many proteases, including trypsin, thrombin and collagenase. A2M is
implicated in Alzheimer disease (AD) due to its ability to mediate the clearance
and degradation of A-beta, the major component of beta-amyloid deposits.
[provided by RefSeq]
but what I wanted is the Gene description (DNA polymerase III subunit beta).So I modified my code to look like this: (Now this is the code for my entire program...please ignore the commented-out code near the end...I am painstakingly working on it. The code most relevant to my question is in green.)
#!/usr/bin/perl#mutation-finder#by: jamieuse warnings;use strict;use lib "/usr/share/perl5/";use Bio::Seq;use Bio::SeqIO;use Getopt::Long;use Bio::DB::EntrezGene;use Bio::ASN1::EntrezGene; #not sure if I have to include thismy %options;GetOptions(\%options, 'i=s', 'o=s', 'headline', 'cutoff=i');# declare the perl command line flagsprint "What is the threshold for detecting mutations? Enter a fraction (0-1): ";my $threshold = <STDIN>;my ($Filein, $Fileout, $Flag, $cutoff);my $headline = '';if ($options{i}) { if ($options{i} =~ /(.*)/) {# input file: /home/jmi/Documents/E5_NC_008596.complete.pileup open IN, $+ or die "Can't open the file: $!"; }}my $seqin = Bio::SeqIO->new(-format => 'GenBank', -file=>'/home/jmi/Documents/NC_008596.gb');my $seqobj = $seqin->next_seq();my $dbentrezgene = Bio::DB::EntrezGene->new();my @feat = ($seqobj->get_SeqFeatures()); #get feature objects NOT subfeatures...tag?if ($options{o}) { if ($options{o} =~ /(.*)/) {# output file: /home/jmi/Documents/E5_NC_008596.complete.pileup.output.?percent die "Can't open the file: $!" unless open OUT, ">$+"; }}if ($options{cutoff}) { if ($options{cutoff} =~ /(.*)/) { $cutoff = $+; print OUT "Cutoff: ",$cutoff,"\t"; }}if ($options{headline}) { print OUT "Threshold (0-1): ",$threshold,"\nCoordinate\tDepth\t\tReference Base\t# of ref. base\t% ref. base\tTop 1 base\t# of top 1 base\t% top 1 base\tTop 2 base\t# of top 2 base\t% top 2 base\tGene name\tGene ID\t\tGene full name\tStrand\t\tGene begin\tGene end\n";}my ($refbase, $refbasecount, $mutbase, $mut2base, $coordinate, $vMax,$v2Max, $k, $v, $length, $tbd, $top1percentage, $top2percentage, $refpercentage, $start, $end, @name, @db_xref, $strand, $strandsign, @genefullname, $isgene);my $readbases = "";my $geneid = 0; #preset for programming reasonswhile (my $in = <IN>) {#store coordinate if ($in =~ /\|\t([0-9]*)\t/) { $coordinate = $+; if (defined($cutoff)) { last if $coordinate > $cutoff; } }#store reference base in $refbase if ($in =~ /\t([A-Z])\t/) { $refbase = $+; }# store and process read bases in $readbases if ($in =~ /\t[ATCG]\t[0-9]*\t([[:punct:][:alnum:]]*)\t/) { $readbases = $+; $readbases =~ s/\^.|\*|\$//g; while ($readbases =~ m/(?<=[\+\-])\d+/g) { my $pos = pos($readbases); $tbd = substr $readbases, $pos, $&; $readbases =~ s/[\+\-]\d+$tbd//; } } my $count1 = $readbases =~ tr/,.//; $length = length($readbases); my $count3 = $length - $count1; $readbases = uc($readbases); unless (($count3 <= $threshold*$length) || ($length <= 1)) { # if reach threshold, identify mutation $readbases =~ s/[,\.]/$refbase/g; my @readbasesarray = split //,$readbases; my %count2 = (); map{ $count2{$_}++ } @readbasesarray; $vMax = 0; $refbasecount = 0; while(($k,$v) = each %count2) { #store base in key and count in value if ($k eq $refbase) { $refbasecount = $v; $refpercentage = $refbasecount/$length*100; $refpercentage = sprintf '%.2f', $refpercentage; } if ($v > $vMax) { $mutbase = $k; $vMax = $v; #number of occurences of top 1 base } $top1percentage = $vMax/$length*100; $top1percentage = sprintf '%.2f', $top1percentage; } $v2Max = 0; while(($k,$v) = each %count2) { if ($v > $v2Max && $k ne $mutbase) { $mut2base = $k; $v2Max = $v; #number of occurences of top 2 base } $top2percentage = $v2Max/$length*100; $top2percentage = sprintf '%.2f', $top2percentage; } for my $f (@feat) { if ($f->primary_tag eq 'gene') { if ($coordinate>=$f->start && $coordinate<=$f->end) { $isgene = 1; @db_xref = $f->get_tag_values("db_xref"); if (($db_xref[-1] =~ /([0-9]+)/) && ($+ != $geneid)) { $geneid = $+; print $geneid, " yay got geneid \n";# my $geneobj = $dbentrezgene->get_Seq_by_id($geneid); ## This is the code I showed you in my last email. I have replaced this with the following code in green.# $genefullname = $geneobj->desc(); my $genein = $dbentrezgene->get_Stream_by_id([$geneid]) or die; #I used this function because I need a Bio::SeqIO object in between, in order... my ($gene,$genestructure,$uncaptured) = $genein->next_seq; #to get all the data from the record, and I hope Gene description falls into $uncaptured. reference: http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/SeqIO/entrezgene.pm#DESCRIPTION @genefullname = @$uncaptured; #Please scroll to end of code.# my @firstarray = $genefullname[1];# print $firstarray[$_],"first!\n" for (@firstarray);# my @secondarray = $genefullname[3];# print $secondarray[$_],"second!\n" for (@secondarray);# my @thirdarray = $genefullname[4];# print $thirdarray[$_],"third!\n" for (@thirdarray);# for my $i (@genefullname) {# if (ref($i)) {# if (ref($genefullname[$i]) eq 'SCALAR') {# print ${$i}," SCALAR ref\n";# } elsif (ref($genefullname[$i]) eq 'ARRAY') {# for my $j(@{$genefullname[$i]}) {# print $j,"yayayay\n";# }# my @array = @{$_};# foreach my $a (@array) {# print length(@array)," LENGTH\n";# print $array[$a]," from ARRAY ref\n";# }# } elsif (ref($_) eq 'HASH') {# my %hash = %{$genefullname[$_]};# foreach my $key(keys %hash) {# print $key," => ",$hash{$key}," from HASH ref\n";# }# }# } else {# print $i," anything else\n";# }# print "going to top of loop again\n"; if ($f->has_tag("gene")) { @name = $f->get_tag_values("gene"); } $strand = $f->strand(); if ($strand == 1) {$strandsign = '+';} elsif ($strand == -1) {$strandsign = '-';} $start = $f->start; $end = $f->end; } } } } unless ($isgene) {undef @name; undef @db_xref; undef $strand; undef $strandsign; undef $start; undef $end; $geneid = 1; undef @genefullname;} #don't undef $geneid!# store results in output file print OUT "$coordinate\t\t$length\t\t$refbase\t\t$refbasecount\t\t$refpercentage\t\t$mutbase\t\t$vMax\t\t$top1percentage\t\t$mut2base\t\t$v2Max\t\t$top2percentage\t\t at name\t\t$geneid\t\t at genefullname\t$strandsign\t\t$start\t\t$end\n"; }}
# print "Common name: ",$seqobj->display_id,"\nUnique implementation key: ",$seqobj->primary_id,"\ndescription: ",$seqobj->desc,"\naccession #: ",$seqobj->accession_number,"\nalphabet: ",$seqobj->alphabet(),"\n";
The problem now is that I can't print the elements of @genefullname directly, because it is a mixed array:
>From genefullname: From genefullname: ARRAY(0x3ddd8a0)From genefullname: 1From genefullname: ARRAY(0x3ddbfd8)From genefullname: ARRAY(0x3ddbcc0)From genefullname: CP000480From genefullname: ARRAY(0x3ddd8a0)From genefullname: 1From genefullname: ARRAY(0x3ddbcf0)From genefullname: HASH(0x3de0808)From genefullname: ARRAY(0x3de4788)From genefullname: Bio::SeqIO::entrezgene=HASH(0x3db8b00)
Please, if you would let me know how to display the Gene description properly, I would appreciate it very much. And thanks for the help so far.
Regards,Jamie
> From: carandraug+dev at gmail.com
> Date: Fri, 15 Jul 2011 03:48:03 +0100
> Subject: Re: [Bioperl-l] can't retrieve description using Bio::DB::EntrezGene
> To: bi_my_heart at hotmail.com
> CC: bioperl-l at bioperl.org
>
> On 14 July 2011 08:54, jmi k <bi_my_heart at hotmail.com> wrote:
> >
> > Hi,
> > I'm trying to retrieve the description of a gene's file using Bio::DB::EntrezGene. Here is the relevant code from my program:
> > use Bio::DB::EntrezGene;use Bio::ASN1::EntrezGene; # I think I have to include this
> > I've found a similar discussion at http://old.nabble.com/Bio::DB::EntrezGene-or-Bio::DB::Query::GenBank-to-obtain-sequence-metadata-without-sequence-td25816381.html but I don't understand why they can't use Bio::DB::EntrezGene directly.
> > Thanks in advance!
> > Regards,Jamie
>
> Hi Jamie,
>
> Your code works fine for me. Please always paste ALL of your code, not
> only the part where you think the error is. It's doesn't take long to
> do so, you get the answer to your problem faster and everyone wastes
> less time in the end. If it's a long piece of code use pastebin
> http://pastebin.com/ Also, please paste it properly formatted, not all
> in one line.
>
> Despite the fact that works, you're doing some unnecessary things such
> as creating a Bio::Seq object that you then write over it with the
> get_seq_by_id method. This should already return the sequence object,
> no need to create it first.
>
> Anyway, here's how to do it with Bio::DB::EUtilities (it's not the
> answer to your question but if you're having trouble with EntrezGene
> and don't mind which module to use to get the job done...)
>
> use Bio::DB::EUtilities;
> my $eutil = Bio::DB::EUtilities->new(
> -eutil => 'esummary',
> -db => 'gene',
> -id => $ref_to_array_with_uids
> );
>
> while (my $docsum = $eutil->next_DocSum) {
> my ($description) = $docsum->get_contents_by_name('Description');
> my ($summary) = $docsum->get_contents_by_name('Summary');
> say $description;
> say $summary;
> }
>
> You can also use the methods to_string on $docsum to get a nice view
> of it and what contents you have retrieved. The get_Item_by_name
> method is also handy.
>
> Also, replace say with print and a newline if you're not using an up
> to date version of perl or upgrade your versoin of perl ;)
>
> Carnë
More information about the Bioperl-l
mailing list