[Bioperl-l] Getting sequences by base pair locations

Fri Jul 28 14:15:38 UTC 2006

Yutal,

You can also do this remotely if the file you want is in GenBank (and you
don't want to store the data locally).  The nice thing about using this is
any seqfeatures in the GenBank file within the region requested is also
returned.

Note that if data is stored in a RefSeq file you'll need to add the
parameter '-no_redirect => 1,' to the Bio::DB::GenBank object.  I would NOT
recommend this for huge numbers of sequences (>2000) as you would be
spamming NCBI with thousands of repeated requests; if you did have a
relatively large number you could run this overnight, which is what I do.
Bio::DB::Fasta would be better if you have tons of hits.  

Use this in a loop to grab the sequences one at a time based on the start,
stop positions, (and strand, if you need it):

# Below is from Bio::DB::GenBank POD, with some modifications
my $factory = Bio::DB::GenBank->new(
                -seq_start => $start,
                -seq_stop => $end,
                -strand => $strand # 1=plus, 2=minus
                );
my $seq_obj;
eval {
    $seq_obj = $factory->get_Seq_by_acc($sf->seq_id);
};
if( $@ ) {
    print STDERR "Unable to retrieve from $start to $end.\n";
    print STDERR "Error!\n$@";
    print STDERR "Attempting to move on...\n";
    next;
}
print STDERR "Got sequence: ",$seq_obj->description,"\n";
print STDERR "\tLength: ",$seq_obj->length,"\n";
my $sf_len = $sf->length;

The eval{} block is needed to make sure retrieval worked via network
connections and to not end based on a network error (the object throws an
error which eval catches, logs it to STDERR, thus allowing you to continue
on). 

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sean Davis
> Sent: Friday, July 28, 2006 8:31 AM
> To: Yuval Itan
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Getting sequences by base pair locations
> 
> Yuval Itan wrote:
> > Hello all,
> >
> > I was BLATing a few hundred human genes against the chimp genome, and
> > kept the best chimp hits for every human gene.
> > I have the base pair start and end location for every chimp hit, and I
> > need to get the sequence for each of these chimp hits. Here is an
> > example for a few chimp hits bp locations:
> >
> > Start End*
> > *142854 144504
> > 154479 155198
> > 153066 167370
> > 163146 163559
> >
> > I have one chimp genome file (about 3GB) including all chromosomes, but
> > I could also get one file per chromosome if that would make things
> > easier. Does anyone have a script or a link for an interface that can do
> > the job?
> 
> See this module:
> 
> http://doc.bioperl.org/releases/bioperl-current/bioperl-
> live/Bio/DB/Fasta.html
> 
> Sean
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l