[Bioperl-l] fetch gene sequence with EUtilities.pm

Chris Fields cjfields at illinois.edu
Wed Jun 10 13:56:46 EDT 2009


Adam,

That's really odd that they do that (both the duplication of ChrStart  
and the coordinates being off-by-one, which means they appear to be 0- 
based).  It's possible that the second ChrStart is meant to represent  
the actual first base for the gene irrespective of start/end.  My  
example is on the opposite strand, so the second ChrStart == end.

The fact that they use the same element name is slightly annoying (and  
seemingly redundant), but there is a workaround.  We grab only the  
layered information specifically; in this case we want everything  
below 'GenomicInfoType':

GenomicInfo
     GenomicInfoType
         ChrLoc      :17
         ChrAccVer   :NC_000083.5
         ChrStart    :32303796
         ChrStop     :32257837

So, we can do this in the DocSum loop (that appears to work for your  
example):

############################

for my $docsum ($eutil->next_DocSum) {
     # to ensure we grab the right ChrStart information, we grab the  
Item above
     # it in the Item hierarchy (visible via print_all from the eutil  
instance)
     my ($item) = $docsum->get_Items_by_name('GenomicInfoType');

     my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

     while (my $sub_item = $item->next_subItem) {
         if (exists $item_data{$sub_item->get_name}) {
             $item_data{$sub_item->get_name} = $sub_item->get_content;
         }
     }
     # check to make sure everything is set
     for my $check (qw(ChrAccVer ChrStart ChrStop)) {
         die "$check not set" unless $item_data{$check};
     }

     my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
     $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                              -seq_start => $item_data{ChrStart} + 1,
                              -seq_stop  => $item_data{ChrStop} + 1,
                              -strand    => $strand);
     print $fetcher->get_Response->content;
}

############################

That's to retain compatibility with 1.6; I'll update the wiki.  I can  
add some common Item container methods to grab information for any  
Items contained in the current instance (be it a DocSum or another  
Item).  I'll add that in bioperl-live.

chris

On Jun 10, 2009, at 9:10 AM, Adam Witney wrote:

> Thanks for the pointers Chris.
>
> The new example on the Cookbook doesn't quite work for me as  
> ChrStart seems to appear in the DocSum twice, thus  
> get_contents_by_name('ChrStart') returns a list of two values (which  
> writes the second ChrStart into $end). Also the $start and $end seem  
> to be out by 1, so I needed to change it to this:
>
> my ($acc) = ($docsum->get_contents_by_name('ChrAccVer'));
> my ($start) = ($docsum->get_contents_by_name('ChrStart'));
> my ($end) = ($docsum->get_contents_by_name('ChrStop'));
>
> $start += 1;
> $end += 1;
>
> Ah, looking at this further there appears to be something going on  
> in the response from Entrez. Compare these two gene records:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=18131 
> 		(your example below)
> http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=2861733 
> 		(my gene)
>
> In both cases you can see that ChrStart appears twice, once as part  
> of the GenomicInfo list and once on its own at the bottom. In my  
> example above the two ChrStart values match, but in the Notch3  
> example you posted the 2nd ChrStart seems to be the same as the  
> ChrStop in the GenomicInfo list. Do you know if the second ChrStart  
> has a separate meaning?
>
> I guess in the Cookbook example we would need to make sure that the  
> get_contents_by_name('ChrStart') picks up the value from the  
> GenomicInfo list, is this possible?
>
> thanks again
>
> adam
>
>
> On 10 Jun 2009, at 14:20, Chris Fields wrote:
>
>> EntrezGene doesn't contain the sequence information; I believe it  
>> just links to the sequence in a specified nuc record with given  
>> coordinates.  You can get to it, but it takes a little trickery; in  
>> essence you need to use the UID to get the gene summary  
>> information, extract that, then grab the sequence record using  
>> seqstart, seqend, and seqstrand.
>>
>> A dump of esummary info for UID 18131, for instance, (using $eutil- 
>> >print_all) gives this info (abbreviated somewhat):
>>
>> UID                 :18131
>> Name                :Notch3
>> Description         :Notch gene homolog 3 (Drosophila)
>> Orgname             :Mus musculus
>> ...
>> GenomicInfo
>>   GenomicInfoType
>>       ChrLoc      :17
>>       ChrAccVer   :NC_000083.5
>>       ChrStart    :32303796
>>       ChrStop     :32257837
>> GeneWeight          :23049
>>
>> The genomic info section gives the accession.version, start, end,  
>> and (implicitly) the strand (ChrStop is less that ChrStart). I have  
>> added an example to the cookbook:
>>
>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#How_do_I_retrieve_the_DNA_sequence_using_EntrezGene_IDs.3F
>>
>> chris
>>
>> On Jun 9, 2009, at 6:20 AM, Adam Witney wrote:
>>
>>> Hi,
>>>
>>> I have been experimenting with the Bio::DB::EUtilities module,  
>>> with help from the Cookbook. But I can't seem to figure out how to  
>>> get the DNA sequence of a gene; all the examples seem to be  
>>> fetching protein sequence.
>>>
>>> How would i go about fetching a sequence using an Entrez GeneID?
>>>
>>> thanks for any help
>>>
>>> adam
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list