[Bioperl-l] automation of translation based on alignment

Mon Mar 22 01:30:06 EDT 2010

Dear Chris,

It seems that Bioperl is "clever" enough to "rectify" my start and stop by
reversing the order.

e.g.
start = 2300
stop = 1600

It will reverse back to 1600 and then 2300.
What else to tell that I'm now working on a circular genome?

-----Original Message-----
From: Chris Fields [mailto:cjfields at illinois.edu] 
Sent: Monday, March 22, 2010 11:41 AM
To: Ross KK Leung
Cc: 'Florent Angly'; 'bioperl-l List'
Subject: Re: [Bioperl-l] automation of translation based on alignment

On Mar 21, 2010, at 8:22 PM, Ross KK Leung wrote:

> Dear Florent,
> 
> Sorry for mis-clicking "reply" instead of "reply-all". Here are my problem
> details:
> 
> Input:
> 
> 1000 multiple aligned DNA sequences
> One of them has Genbank file
> http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1
> 
> the remaining 999 ones only have genomic sequences.
> 
> Objective: to derive the cognate protein aligned sequences. (here have 4
> sets as there are 4 overlapping genes)
> 
> Difficulties: 
> 1) circular genome
> 2) there may be in-dels

To preface this, any reason you're not translating the alignment sequences
using the above sequence's features as a reference?  One could try
converting the reference sequence's feature coordinates to alignment
column-based positions, pull sub-alignments out from there, then translate
each sequence.  There would be no need to re-retrieve sequences which are
already present in the alignment, unless there is something not mentioned
above that I'm missing.

Re: circular genomes: recent commits to bioperl should allow handling
circular genomes with features and subsequence extraction.  If not I would
consider that a serious bug that needs to be reported.

If you need to grab remote sequences from a larger set of sequences (either
locally or remotely) and translate them, you can use Bio::DB::GenBank, which
will directly return a Bio::Seq object.  Note you would obviously have to
reset these per ID based on the start/end/strand:

my $gb = Bio::DB::GenBank->new(-format     => 'Fasta',
                               -seq_start  => 100,
                               -seq_stop   => 200,
                               -strand     => 1);

my $seqobj = $gb->get_Seq_by_id($id); # or get_Seq_by_acc($acc)
# do any preprocessing here...
my $protein_seqobj = $seq->translate;

If you want you could also download the sequences and use one of the various
flatfile database classes to work with them (I believe Bio::DB::Fasta
extracts subsequences very rapidly).  It might be faster.  For those regions
that cross the origin you may need to pull two sequences and join them
somehow, as the sequences likely won't run a join automatically.

> Hope now the problem has been clarified, Ross

Hope this helps.

chris