[Bioperl-l] Getting genomic coordinates for a list of genes

Mark A. Jensen maj at fortinbras.us
Tue Jul 21 20:41:16 EDT 2009


Yikes! Left off the includes: Put these at the beginning of the script...

 use strict;
 use Bio::DB::EntrezGene;
 use Bio::WebAgent;
 my $ua = Bio::WebAgent->new();
 my $db = new Bio::DB::EntrezGene;

Ugh.
MAJ


----- Original Message ----- 
From: "Mark A. Jensen" <maj at fortinbras.us>
To: "Emanuele Osimo" <e.osimo at gmail.com>; "perl bioperl ml" 
<bioperl-l at lists.open-bio.org>
Sent: Tuesday, July 21, 2009 8:29 PM
Subject: Re: [Bioperl-l] Getting genomic coordinates for a list of genes


> Hi Emanuele--
>
> Well, that script looks a bit hackish to me. Here's another hack, maybe less
> hackish, that seems to work for me. Gory details are at the end of the post.
>
> Here's the output:
>
> Possibilities:
> chr     start           end             src    info
> 1       15691382        15723971        UCSC    CASP9 (uc001awq.1) at 
> chr1:15691382-15723971
> 1       15691382        15723377        UCSC    CASP9 (uc001awn.1) at 
> chr1:15691382-15723377
> 1       15691382        15723377        Ref     NM_001229 at 
> chr1:15691382-15723377
>
> Here's the hack:
>
> my @info = genome_coords(1, $db, $ua);
>
> print "Possibilities:\n";
> print join("\t", qw( chr start end src info )), "\n";
> foreach (@info) {
>     print join("\t", @{$_}{qw( chr start end src info )}), "\n";
> }
>
> # work done here...
> sub genome_coords {
>     my ($id, $db, $ua) = @_;
>     my $seq = $db->get_Seq_by_id($id);
>     my $ac = $seq->annotation;
>     for my $ann ($ac->get_Annotations('dblink')) {
>        if ($ann->database eq "UCSC") {
>            my $resp = $ua->get($ann->url);
>            my @a = $resp->content =~ 
> m{position=chr([0-9]+):([0-9]+)-([0-9]+)\&.*\&(known|ref)Gene.*?\">(.*?)</A>}g;
>            my @ret;
>            while (@a) {
>                push @ret, {
>                    'chr' => shift @a,
>                    'start' => shift @a,
>                    'end' => shift @a,
>                    'src' => shift(@a) eq 'known' ? 'UCSC' : 'Ref',
>                    'info' => shift @a
>                };
>            }
>            return unless @ret;
>            return @ret;
>        }
>    }
>    return; # parse error, no UCSC link on page
> }
>
> Here are the details:
> The script on the wiki page gets coordinates by looking at a url
> under a link on the page: the database "ModelMaker" link, whose
> url is (after the 842 query):
>
> 'http://www.ncbi.nlm.nih.gov/mapview/modelmaker.cgi?taxid=9606&contig=NT_004610.19&from=2498878&to=2530877&gene=CASP9&lid=842'
>
> The script reads the 'from' and 'to' values directly from the text
> of this url to deliver the coordinates. This is somewhat hacky,
> since the assumption is the coordinates that ModelMaker wants
> (were you to actually visit the link, which the script doesn't do)
> are the ones you want. The hack above is slightly better, in that
> it finds a database url link and visits it, then parses the page that the
> link returns -- the UCSC page for geneid 842, more likely to
> have what you want. It's still hacky, in that the format of that page
> may change, and that may break the regexp. But by then, you'll
> be able to hack yourself out of that situation!
>
> cheers,
> Mark
>
>
>
> ----- Original Message ----- 
> From: "Emanuele Osimo" <e.osimo at gmail.com>
> To: "perl bioperl ml" <bioperl-l at lists.open-bio.org>
> Sent: Friday, July 17, 2009 8:49 AM
> Subject: [Bioperl-l] Getting genomic coordinates for a list of genes
>
>
>> Hello everyone,
>> I'm new to programming, I'm a biologist, so please forgive my ignorance, but
>> I've been trying this for 2 weeks, now I have to ask you.
>> I'm trying the script I found at
>> http://bio.perl.org/wiki/HOWTO:Getting_Genomic_Sequences#Using_Bio::DB::EntrezGene_to_get_genomic_coordinates
>> because I need to have some variables (like $from and $to) assigned to the
>> start and end of a gene.
>> The script works fine, but gives me the wrong coordinates: for example if I
>> try it with the gene  842 (CASP9), it prints:
>> NT_004610.19    2498878    2530877
>>
>> I found out that in Entrez, for each gene (for CASP9, for example, at
>> http://www.ncbi.nlm.nih.gov/gene/842?ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq
>> ) under "Genome Reference Consortium Human Build 37 (GRCh37),
>> Primary_Assembly" there are two different sets of coordinates. The first is
>> called "NC_000001.10 Genome Reference Consortium Human Build 37 (GRCh37),
>> Primary_Assembly", and is the one I need, and the second one is called just
>> "NT_004610.19" and it's the one that the script prints.
>> This is valid for all the genes I tried.
>>
>> DO you know how to make the script print the "right" coordinates (at least,
>> the one I need)?
>> Thanks a lot in advance,
>> Emanuele
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> 



More information about the Bioperl-l mailing list