[Bioperl-l] fetch gene sequence with EUtilities.pm
Adam Witney
awitney at sgul.ac.uk
Wed Jun 10 10:10:21 EDT 2009
Thanks for the pointers Chris.
The new example on the Cookbook doesn't quite work for me as ChrStart
seems to appear in the DocSum twice, thus
get_contents_by_name('ChrStart') returns a list of two values (which
writes the second ChrStart into $end). Also the $start and $end seem
to be out by 1, so I needed to change it to this:
my ($acc) = ($docsum->get_contents_by_name('ChrAccVer'));
my ($start) = ($docsum->get_contents_by_name('ChrStart'));
my ($end) = ($docsum->get_contents_by_name('ChrStop'));
$start += 1;
$end += 1;
Ah, looking at this further there appears to be something going on in
the response from Entrez. Compare these two gene records:
http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
db=gene&id=18131 (your example below)
http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=2861733
(my gene)
In both cases you can see that ChrStart appears twice, once as part of
the GenomicInfo list and once on its own at the bottom. In my example
above the two ChrStart values match, but in the Notch3 example you
posted the 2nd ChrStart seems to be the same as the ChrStop in the
GenomicInfo list. Do you know if the second ChrStart has a separate
meaning?
I guess in the Cookbook example we would need to make sure that the
get_contents_by_name('ChrStart') picks up the value from the
GenomicInfo list, is this possible?
thanks again
adam
On 10 Jun 2009, at 14:20, Chris Fields wrote:
> EntrezGene doesn't contain the sequence information; I believe it
> just links to the sequence in a specified nuc record with given
> coordinates. You can get to it, but it takes a little trickery; in
> essence you need to use the UID to get the gene summary information,
> extract that, then grab the sequence record using seqstart, seqend,
> and seqstrand.
>
> A dump of esummary info for UID 18131, for instance, (using $eutil-
> >print_all) gives this info (abbreviated somewhat):
>
> UID :18131
> Name :Notch3
> Description :Notch gene homolog 3 (Drosophila)
> Orgname :Mus musculus
> ...
> GenomicInfo
> GenomicInfoType
> ChrLoc :17
> ChrAccVer :NC_000083.5
> ChrStart :32303796
> ChrStop :32257837
> GeneWeight :23049
>
> The genomic info section gives the accession.version, start, end,
> and (implicitly) the strand (ChrStop is less that ChrStart). I have
> added an example to the cookbook:
>
> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#How_do_I_retrieve_the_DNA_sequence_using_EntrezGene_IDs.3F
>
> chris
>
> On Jun 9, 2009, at 6:20 AM, Adam Witney wrote:
>
>> Hi,
>>
>> I have been experimenting with the Bio::DB::EUtilities module, with
>> help from the Cookbook. But I can't seem to figure out how to get
>> the DNA sequence of a gene; all the examples seem to be fetching
>> protein sequence.
>>
>> How would i go about fetching a sequence using an Entrez GeneID?
>>
>> thanks for any help
>>
>> adam
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list