[Bioperl-l] fetch gene sequence with EUtilities.pm
Chris Fields
cjfields at illinois.edu
Wed Jun 10 17:56:46 UTC 2009
Adam,
That's really odd that they do that (both the duplication of ChrStart
and the coordinates being off-by-one, which means they appear to be 0-
based). It's possible that the second ChrStart is meant to represent
the actual first base for the gene irrespective of start/end. My
example is on the opposite strand, so the second ChrStart == end.
The fact that they use the same element name is slightly annoying (and
seemingly redundant), but there is a workaround. We grab only the
layered information specifically; in this case we want everything
below 'GenomicInfoType':
GenomicInfo
GenomicInfoType
ChrLoc :17
ChrAccVer :NC_000083.5
ChrStart :32303796
ChrStop :32257837
So, we can do this in the DocSum loop (that appears to work for your
example):
############################
for my $docsum ($eutil->next_DocSum) {
# to ensure we grab the right ChrStart information, we grab the
Item above
# it in the Item hierarchy (visible via print_all from the eutil
instance)
my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);
while (my $sub_item = $item->next_subItem) {
if (exists $item_data{$sub_item->get_name}) {
$item_data{$sub_item->get_name} = $sub_item->get_content;
}
}
# check to make sure everything is set
for my $check (qw(ChrAccVer ChrStart ChrStop)) {
die "$check not set" unless $item_data{$check};
}
my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
$fetcher->set_parameters(-id => $item_data{ChrAccVer},
-seq_start => $item_data{ChrStart} + 1,
-seq_stop => $item_data{ChrStop} + 1,
-strand => $strand);
print $fetcher->get_Response->content;
}
############################
That's to retain compatibility with 1.6; I'll update the wiki. I can
add some common Item container methods to grab information for any
Items contained in the current instance (be it a DocSum or another
Item). I'll add that in bioperl-live.
chris
On Jun 10, 2009, at 9:10 AM, Adam Witney wrote:
> Thanks for the pointers Chris.
>
> The new example on the Cookbook doesn't quite work for me as
> ChrStart seems to appear in the DocSum twice, thus
> get_contents_by_name('ChrStart') returns a list of two values (which
> writes the second ChrStart into $end). Also the $start and $end seem
> to be out by 1, so I needed to change it to this:
>
> my ($acc) = ($docsum->get_contents_by_name('ChrAccVer'));
> my ($start) = ($docsum->get_contents_by_name('ChrStart'));
> my ($end) = ($docsum->get_contents_by_name('ChrStop'));
>
> $start += 1;
> $end += 1;
>
> Ah, looking at this further there appears to be something going on
> in the response from Entrez. Compare these two gene records:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=18131
> (your example below)
> http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=2861733
> (my gene)
>
> In both cases you can see that ChrStart appears twice, once as part
> of the GenomicInfo list and once on its own at the bottom. In my
> example above the two ChrStart values match, but in the Notch3
> example you posted the 2nd ChrStart seems to be the same as the
> ChrStop in the GenomicInfo list. Do you know if the second ChrStart
> has a separate meaning?
>
> I guess in the Cookbook example we would need to make sure that the
> get_contents_by_name('ChrStart') picks up the value from the
> GenomicInfo list, is this possible?
>
> thanks again
>
> adam
>
>
> On 10 Jun 2009, at 14:20, Chris Fields wrote:
>
>> EntrezGene doesn't contain the sequence information; I believe it
>> just links to the sequence in a specified nuc record with given
>> coordinates. You can get to it, but it takes a little trickery; in
>> essence you need to use the UID to get the gene summary
>> information, extract that, then grab the sequence record using
>> seqstart, seqend, and seqstrand.
>>
>> A dump of esummary info for UID 18131, for instance, (using $eutil-
>> >print_all) gives this info (abbreviated somewhat):
>>
>> UID :18131
>> Name :Notch3
>> Description :Notch gene homolog 3 (Drosophila)
>> Orgname :Mus musculus
>> ...
>> GenomicInfo
>> GenomicInfoType
>> ChrLoc :17
>> ChrAccVer :NC_000083.5
>> ChrStart :32303796
>> ChrStop :32257837
>> GeneWeight :23049
>>
>> The genomic info section gives the accession.version, start, end,
>> and (implicitly) the strand (ChrStop is less that ChrStart). I have
>> added an example to the cookbook:
>>
>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#How_do_I_retrieve_the_DNA_sequence_using_EntrezGene_IDs.3F
>>
>> chris
>>
>> On Jun 9, 2009, at 6:20 AM, Adam Witney wrote:
>>
>>> Hi,
>>>
>>> I have been experimenting with the Bio::DB::EUtilities module,
>>> with help from the Cookbook. But I can't seem to figure out how to
>>> get the DNA sequence of a gene; all the examples seem to be
>>> fetching protein sequence.
>>>
>>> How would i go about fetching a sequence using an Entrez GeneID?
>>>
>>> thanks for any help
>>>
>>> adam
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list