[Bioperl-l] fetch gene sequence with EUtilities.pm

Adam Witney awitney at sgul.ac.uk
Wed Jun 10 14:10:21 UTC 2009


Thanks for the pointers Chris.

The new example on the Cookbook doesn't quite work for me as ChrStart  
seems to appear in the DocSum twice, thus  
get_contents_by_name('ChrStart') returns a list of two values (which  
writes the second ChrStart into $end). Also the $start and $end seem  
to be out by 1, so I needed to change it to this:

my ($acc) = ($docsum->get_contents_by_name('ChrAccVer'));
my ($start) = ($docsum->get_contents_by_name('ChrStart'));
my ($end) = ($docsum->get_contents_by_name('ChrStop'));

  $start += 1;
  $end += 1;

Ah, looking at this further there appears to be something going on in  
the response from Entrez. Compare these two gene records:

http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi? 
db=gene&id=18131		(your example below)
http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=2861733 
		(my gene)

In both cases you can see that ChrStart appears twice, once as part of  
the GenomicInfo list and once on its own at the bottom. In my example  
above the two ChrStart values match, but in the Notch3 example you  
posted the 2nd ChrStart seems to be the same as the ChrStop in the  
GenomicInfo list. Do you know if the second ChrStart has a separate  
meaning?

I guess in the Cookbook example we would need to make sure that the  
get_contents_by_name('ChrStart') picks up the value from the  
GenomicInfo list, is this possible?

thanks again

adam


On 10 Jun 2009, at 14:20, Chris Fields wrote:

> EntrezGene doesn't contain the sequence information; I believe it  
> just links to the sequence in a specified nuc record with given  
> coordinates.  You can get to it, but it takes a little trickery; in  
> essence you need to use the UID to get the gene summary information,  
> extract that, then grab the sequence record using seqstart, seqend,  
> and seqstrand.
>
> A dump of esummary info for UID 18131, for instance, (using $eutil- 
> >print_all) gives this info (abbreviated somewhat):
>
> UID                 :18131
> Name                :Notch3
> Description         :Notch gene homolog 3 (Drosophila)
> Orgname             :Mus musculus
> ...
> GenomicInfo
>    GenomicInfoType
>        ChrLoc      :17
>        ChrAccVer   :NC_000083.5
>        ChrStart    :32303796
>        ChrStop     :32257837
> GeneWeight          :23049
>
> The genomic info section gives the accession.version, start, end,  
> and (implicitly) the strand (ChrStop is less that ChrStart). I have  
> added an example to the cookbook:
>
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#How_do_I_retrieve_the_DNA_sequence_using_EntrezGene_IDs.3F
>
> chris
>
> On Jun 9, 2009, at 6:20 AM, Adam Witney wrote:
>
>> Hi,
>>
>> I have been experimenting with the Bio::DB::EUtilities module, with  
>> help from the Cookbook. But I can't seem to figure out how to get  
>> the DNA sequence of a gene; all the examples seem to be fetching  
>> protein sequence.
>>
>> How would i go about fetching a sequence using an Entrez GeneID?
>>
>> thanks for any help
>>
>> adam
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list