[Bioperl-l] automation of translation based on alignment

Sun Mar 21 23:40:34 EDT 2010

On Mar 21, 2010, at 8:22 PM, Ross KK Leung wrote:

> Dear Florent,
> 
> Sorry for mis-clicking "reply" instead of "reply-all". Here are my problem
> details:
> 
> Input:
> 
> 1000 multiple aligned DNA sequences
> One of them has Genbank file
> http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1
> 
> the remaining 999 ones only have genomic sequences.
> 
> Objective: to derive the cognate protein aligned sequences. (here have 4
> sets as there are 4 overlapping genes)
> 
> Difficulties: 
> 1) circular genome
> 2) there may be in-dels

To preface this, any reason you're not translating the alignment sequences using the above sequence's features as a reference?  One could try converting the reference sequence's feature coordinates to alignment column-based positions, pull sub-alignments out from there, then translate each sequence.  There would be no need to re-retrieve sequences which are already present in the alignment, unless there is something not mentioned above that I'm missing.

Re: circular genomes: recent commits to bioperl should allow handling circular genomes with features and subsequence extraction.  If not I would consider that a serious bug that needs to be reported.

If you need to grab remote sequences from a larger set of sequences (either locally or remotely) and translate them, you can use Bio::DB::GenBank, which will directly return a Bio::Seq object.  Note you would obviously have to reset these per ID based on the start/end/strand:

my $gb = Bio::DB::GenBank->new(-format     => 'Fasta',
                               -seq_start  => 100,
                               -seq_stop   => 200,
                               -strand     => 1);

my $seqobj = $gb->get_Seq_by_id($id); # or get_Seq_by_acc($acc)
# do any preprocessing here...
my $protein_seqobj = $seq->translate;

If you want you could also download the sequences and use one of the various flatfile database classes to work with them (I believe Bio::DB::Fasta extracts subsequences very rapidly).  It might be faster.  For those regions that cross the origin you may need to pull two sequences and join them somehow, as the sequences likely won't run a join automatically.

> Hope now the problem has been clarified, Ross

Hope this helps.

chris