[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Thu Feb 16 22:45:52 UTC 2006

If I know the start, end, and strand info for a list of features (personal
preference, since I use Bio::SeqFeature::Generic with the RNAMotif I drew
up), couldn't I try pulling out the surrounding region?  My thought is this,
though I haven't coded it yet:

1)  Draw up a list of Seqfeatures, with accession, start, stop coordinates
(array of hashes) based off what I get from RNAMotif objects.
2)  Pull the sequence from NCBI using Bio::DB::GenBank with x bp upstream
and downstream, one at a time, using get_Seq_by_ID().  I could add a sleep
in there somewhere to not tick off the NCBI curators.

Reason I'm interested in this is b/c I want to know where the RNA motif is
in context to surrounding features. If it is very close to a coding region,
then the motif likely indicates translational regulation.  Further away may
indicate transcriptional termination or another mechanism.

The files returned should have the features included as long as they are in
the full length GenBank record.  I tried it out using the web form but not
through Bio::DB::GenBank yet.  If I can get it to work I'll add it to the
page.  

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 

> -----Original Message-----
> From: Brian Osborne [mailto:osborne1 at optonline.net]
> Sent: Thursday, February 16, 2006 4:19 PM
> To: Chris Fields
> Cc: Harry Mangalam; bioperl-l
> Subject: Re: [Bioperl-l] Fetching genomic sequences based on HUGO names or
> GeneIDs
> 
> Chris,
> 
> Yes. The question now is where to easily get the coordinates.
> 
> Brian O.
> 
> 
> On 2/16/06 7:52 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> 
> > I think a method was recently implemented in Bio::DB::GenBank to
> > retrieve a segment of DNA given start and end coordinates in GenBank
> > format; that should contain the features you need.  I requested it
> > ~Nov-Dec in the mailing list but didn't get a chance to test it.
> > Would that help?
> >
> > On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
> >
> >> Harry,
> >>
> >> It's not clear to me that NCBI's eutils offers this capability
> >> directly. You
> >> can probably download Entrez Gene entries and parse them for
> >> coordinates but
> >> I know of no way to remotely retrieve genomic sequences like this
> >> from NCBI
> >> (ENSEMBL API perhaps?). What I had in mind uses the local approach
> >> that some
> >> of us favor and to prove to myself that this is simple to do I wrote a
> >> script that I just added to examples/tools, it's called
> >> extract_genes.pl and
> >> it's based on Bio::DB::Fasta. Download the sequence files for a given
> >> species to some dir, download Entrez Gene's gene2accession file,
> >> and run. It
> >> creates and stores a hash for lookups, it won't read gene2accession
> >> each
> >> time it runs.
> >>
> >> Brian O.
> >>
> >>
> >> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> >>
> >>> Hi Brian,
> >>>
> >>> Thanks very much for the pointers and the speed of your reply and
> >>> apologies
> >>> for the speed of mine.
> >>>
> >>> This looks good, but what I was looking for was a bioP approach
> >>> for hooking to
> >>> an API at NCBI or EBI so I could get this info and seqs from
> >>> them.  In this
> >>> case, speed of retrieval is not critical and I'd rather not
> >>> download the
> >>> entirety of the sequences to a local disk to hack at them.
> >>>
> >>> I've determined a screen-scraping approach to get them and could
> >>> script that,
> >>> but I thought that bioP had a method for using NCBI's external
> >>> API's, tho it
> >>> may be that my memory is faulty or the approach is no longer
> >>> supported due to
> >>> overload.
> >>>
> >>> Does NCBI make such APIs available anymore?  I searched a bit for
> >>> docs on them
> >>> but couldn't find anything (unless it's buried in the NCBI tookit,
> >>> which I
> >>> haven't started to excavate).
> >>>
> >>> Failing that, would SEALS provide such a service? Any PerlPinipeds
> >>> listening?
> >>>
> >>> Harry
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
> >>>> Harry,
> >>>>
> >>>> Hope you're doing well. The approach could be based on
> >>>> Bio::DB::Fasta. So,
> >>>> from its documentation:
> >>>>
> >>>>   use Bio::DB::Fasta;
> >>>>
> >>>>   # create database from directory of fasta files
> >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> >>>>
> >>>>   # simple access (for those without Bioperl)
> >>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
> >>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
> >>>>   my @ids     = $db->ids;
> >>>>   my $length   = $db->length('CHROMOSOME_I');
> >>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
> >>>>   my $header   = $db->header('CHROMOSOME_I');
> >>>>
> >>>>   # Bioperl-style access
> >>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> >>>>
> >>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
> >>>>   my $seq     = $obj->seq;
> >>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
> >>>>
> >>>> Do you already have the offsets?
> >>>>
> >>>> Brian O.
> >>>>
> >>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> >>>>> Hi All,
> >>>>>
> >>>>> After perusing the tutorial and other docs for a an evening, I
> >>>>> still
> >>>>> can't find the answer to this.  Forgive me if I've missed something
> >>>>> obvious.
> >>>>>
> >>>>> This should not be a novel request, but I've not found it
> >>>>> answered.  If
> >>>>> bioperl isn't the best way to do this, I'd be grateful to a
> >>>>> pointer to a
> >>>>> better way, especially if it includes an illuminating bit of code.
> >>>>>
> >>>>> The problem is to retrieve genomic sequences plus & minus some
> >>>>> offset
> >>>>> from a locus determined by HUGO keyword or GeneID.  This would be a
> >>>>> common followup chore for some extra analysis from a gene
> >>>>> expression
> >>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
> >>>>> the
> >>>>> sequence type to specify...?
> >>>>>
> >>>>>
> >>>>> TIA!
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l