[Bioperl-l] Getting genomic coordinates for a list of genes
Mark A. Jensen
maj at fortinbras.us
Wed Jul 22 00:29:18 UTC 2009
Hi Emanuele--
Well, that script looks a bit hackish to me. Here's another hack, maybe less
hackish, that seems to work for me. Gory details are at the end of the post.
Here's the output:
Possibilities:
chr start end src info
1 15691382 15723971 UCSC CASP9 (uc001awq.1) at
chr1:15691382-15723971
1 15691382 15723377 UCSC CASP9 (uc001awn.1) at
chr1:15691382-15723377
1 15691382 15723377 Ref NM_001229 at
chr1:15691382-15723377
Here's the hack:
my @info = genome_coords(1, $db, $ua);
print "Possibilities:\n";
print join("\t", qw( chr start end src info )), "\n";
foreach (@info) {
print join("\t", @{$_}{qw( chr start end src info )}), "\n";
}
# work done here...
sub genome_coords {
my ($id, $db, $ua) = @_;
my $seq = $db->get_Seq_by_id($id);
my $ac = $seq->annotation;
for my $ann ($ac->get_Annotations('dblink')) {
if ($ann->database eq "UCSC") {
my $resp = $ua->get($ann->url);
my @a = $resp->content =~
m{position=chr([0-9]+):([0-9]+)-([0-9]+)\&.*\&(known|ref)Gene.*?\">(.*?)</A>}g;
my @ret;
while (@a) {
push @ret, {
'chr' => shift @a,
'start' => shift @a,
'end' => shift @a,
'src' => shift(@a) eq 'known' ? 'UCSC' : 'Ref',
'info' => shift @a
};
}
return unless @ret;
return @ret;
}
}
return; # parse error, no UCSC link on page
}
Here are the details:
The script on the wiki page gets coordinates by looking at a url
under a link on the page: the database "ModelMaker" link, whose
url is (after the 842 query):
'http://www.ncbi.nlm.nih.gov/mapview/modelmaker.cgi?taxid=9606&contig=NT_004610.19&from=2498878&to=2530877&gene=CASP9&lid=842'
The script reads the 'from' and 'to' values directly from the text
of this url to deliver the coordinates. This is somewhat hacky,
since the assumption is the coordinates that ModelMaker wants
(were you to actually visit the link, which the script doesn't do)
are the ones you want. The hack above is slightly better, in that
it finds a database url link and visits it, then parses the page that the
link returns -- the UCSC page for geneid 842, more likely to
have what you want. It's still hacky, in that the format of that page
may change, and that may break the regexp. But by then, you'll
be able to hack yourself out of that situation!
cheers,
Mark
----- Original Message -----
From: "Emanuele Osimo" <e.osimo at gmail.com>
To: "perl bioperl ml" <bioperl-l at lists.open-bio.org>
Sent: Friday, July 17, 2009 8:49 AM
Subject: [Bioperl-l] Getting genomic coordinates for a list of genes
> Hello everyone,
> I'm new to programming, I'm a biologist, so please forgive my ignorance, but
> I've been trying this for 2 weeks, now I have to ask you.
> I'm trying the script I found at
> http://bio.perl.org/wiki/HOWTO:Getting_Genomic_Sequences#Using_Bio::DB::EntrezGene_to_get_genomic_coordinates
> because I need to have some variables (like $from and $to) assigned to the
> start and end of a gene.
> The script works fine, but gives me the wrong coordinates: for example if I
> try it with the gene 842 (CASP9), it prints:
> NT_004610.19 2498878 2530877
>
> I found out that in Entrez, for each gene (for CASP9, for example, at
> http://www.ncbi.nlm.nih.gov/gene/842?ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq
> ) under "Genome Reference Consortium Human Build 37 (GRCh37),
> Primary_Assembly" there are two different sets of coordinates. The first is
> called "NC_000001.10 Genome Reference Consortium Human Build 37 (GRCh37),
> Primary_Assembly", and is the one I need, and the second one is called just
> "NT_004610.19" and it's the one that the script prints.
> This is valid for all the genes I tried.
>
> DO you know how to make the script print the "right" coordinates (at least,
> the one I need)?
> Thanks a lot in advance,
> Emanuele
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list