[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Thu Feb 16 17:59:37 UTC 2006

Chris and Harry,

I'm writing a Wiki page on this, it's linked to the FAQ as Wiki is
complaining that the FAQ is getting too big. I'll fill in the ENSEMBL API
and Bio::DB::Fasta approaches, if you would comment on the BioPerl/eutils
approach at some point that would be superb:

http://bioperl.open-bio.org/wiki/Getting_Genomic_Sequences

Brian O.

On 2/16/06 11:23 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:

> Yes, I'm going to  try this 1st.  Also the pointer to the NCBI eutils page was
> helpful.  They describe the same thing and I think that API will give me what
> I need.  I'll post back to report.
> 
> Sorry for the delay in answering - this is a side project and as such is going
> slow.
> 
> Many thanks to you guys, especially Brian for the example code - much more
> than I had a right to expect.  Virtual Beers all round and real ones should
> we ever meet up.
> 
> Harry
> 
> 
> On Thursday 16 February 2006 04:52, Chris Fields wrote:
>> I think a method was recently implemented in Bio::DB::GenBank to
>> retrieve a segment of DNA given start and end coordinates in GenBank
>> format; that should contain the features you need.  I requested it
>> ~Nov-Dec in the mailing list but didn't get a chance to test it.
>> Would that help?
>> 
>> On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
>>> Harry,
>>> 
>>> It's not clear to me that NCBI's eutils offers this capability
>>> directly. You
>>> can probably download Entrez Gene entries and parse them for
>>> coordinates but
>>> I know of no way to remotely retrieve genomic sequences like this
>>> from NCBI
>>> (ENSEMBL API perhaps?). What I had in mind uses the local approach
>>> that some
>>> of us favor and to prove to myself that this is simple to do I wrote a
>>> script that I just added to examples/tools, it's called
>>> extract_genes.pl and
>>> it's based on Bio::DB::Fasta. Download the sequence files for a given
>>> species to some dir, download Entrez Gene's gene2accession file,
>>> and run. It
>>> creates and stores a hash for lookups, it won't read gene2accession
>>> each
>>> time it runs.
>>> 
>>> Brian O.
>>> 
>>> On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>> Hi Brian,
>>>> 
>>>> Thanks very much for the pointers and the speed of your reply and
>>>> apologies
>>>> for the speed of mine.
>>>> 
>>>> This looks good, but what I was looking for was a bioP approach
>>>> for hooking to
>>>> an API at NCBI or EBI so I could get this info and seqs from
>>>> them.  In this
>>>> case, speed of retrieval is not critical and I'd rather not
>>>> download the
>>>> entirety of the sequences to a local disk to hack at them.
>>>> 
>>>> I've determined a screen-scraping approach to get them and could
>>>> script that,
>>>> but I thought that bioP had a method for using NCBI's external
>>>> API's, tho it
>>>> may be that my memory is faulty or the approach is no longer
>>>> supported due to
>>>> overload.
>>>> 
>>>> Does NCBI make such APIs available anymore?  I searched a bit for
>>>> docs on them
>>>> but couldn't find anything (unless it's buried in the NCBI tookit,
>>>> which I
>>>> haven't started to excavate).
>>>> 
>>>> Failing that, would SEALS provide such a service? Any PerlPinipeds
>>>> listening?
>>>> 
>>>> Harry
>>>> 
>>>> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>>>>> Harry,
>>>>> 
>>>>> Hope you're doing well. The approach could be based on
>>>>> Bio::DB::Fasta. So,
>>>>> from its documentation:
>>>>> 
>>>>>   use Bio::DB::Fasta;
>>>>> 
>>>>>   # create database from directory of fasta files
>>>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>> 
>>>>>   # simple access (for those without Bioperl)
>>>>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>>>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>>>>   my @ids     = $db->ids;
>>>>>   my $length   = $db->length('CHROMOSOME_I');
>>>>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>>>>   my $header   = $db->header('CHROMOSOME_I');
>>>>> 
>>>>>   # Bioperl-style access
>>>>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>>>>> 
>>>>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>>>>   my $seq     = $obj->seq;
>>>>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>>>>> 
>>>>> Do you already have the offsets?
>>>>> 
>>>>> Brian O.
>>>>> 
>>>>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> After perusing the tutorial and other docs for a an evening, I
>>>>>> still
>>>>>> can't find the answer to this.  Forgive me if I've missed something
>>>>>> obvious.
>>>>>> 
>>>>>> This should not be a novel request, but I've not found it
>>>>>> answered.  If
>>>>>> bioperl isn't the best way to do this, I'd be grateful to a
>>>>>> pointer to a
>>>>>> better way, especially if it includes an illuminating bit of code.
>>>>>> 
>>>>>> The problem is to retrieve genomic sequences plus & minus some
>>>>>> offset
>>>>>> from a locus determined by HUGO keyword or GeneID.  This would be a
>>>>>> common followup chore for some extra analysis from a gene
>>>>>> expression
>>>>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
>>>>>> the
>>>>>> sequence type to specify...?
>>>>>> 
>>>>>> 
>>>>>> TIA!
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign