[Bioperl-l] Bio::SimpleAlign constructor?

Mon Aug 24 13:36:32 UTC 2009

Dan, all,

Bio::SimpleAlign doesn't align anything for you.  It makes no  
assumptions about the data being added, beyond possibly checking for  
the seqs to be flush prior to analyses.

Here's the reason why:

The object doesn't 'know' the seqs map across from one to the other as  
below:

> ...
> ## REF  tacattaaagacccg
> ## SEQ1 taca.taaa......
> ## SEQ2 .....taaaga.ccg
>
> my $aln = Bio::SimpleAlign->new();
>
> $aln->gap_char('.');
>
> my $r  = Bio::LocatableSeq->new( -id=>'r', -seq=>'tacattaaagacccg' );
> my $s1 = Bio::LocatableSeq->new( -id=>'s1', -start=>1, - 
> seq=>'taca.taaa' );
> my $s2 = Bio::LocatableSeq->new( -id=>'s2', -start=>6, - 
> seq=>'taaaga.ccg' );
>
> $aln->add_seq( $r );
> $aln->add_seq( $s1 );
> $aln->add_seq( $s2 );

Above, you are making the assumption that SimpleAlign 'knows' where to  
match the start of $s1 and $s2 to the ref sequence $r.   
LocatableSeq::start() does NOT indicate that (the LocatableSeq docs,  
and their usage, should indicate that).

Think about HSP alignments in a BLAST report; the start/end/strand  
coordinates are where the sequence in the alignment maps to the  
original query or hit sequence.  They don't indicate where the hit  
maps to the query (the alignment itself does that in a column-wise  
fashion).

I'm not sure, maybe it needs to be more explicit in the documentation,  
but SimpleAlign does not align the sequences for you (and it shouldn't  
be expected to).  There are much better (faster, more accurate) ways  
to do that.

> if($CLUDGE){
> foreach(($r, $s1, $s2)){
>   $_->seq( '.' x ($_->start - 1) . $_->seq )
> }
> }
>
> ## Prepare an 'output stream' for the alignment:
> my $aliWriter = Bio::AlignIO->
>  new( -fh     => \*STDOUT,
>       -format => 'clustalw',
>     );
>
> warn "\nOUTPUT:\n";
> $aliWriter->write_aln($aln);

...

> I was calling the "fill in the gaps yourself" step a CLUDGE because I
> had expected the alignment object to take care of this for me.  Is
> there any reason that it couldn't do this 'CLUDGE' automatically? It
> seems strange that it insists on being passed locatable sequence
> objects, but then largely ignore the given location.
>
> Would it not be possible to have this happen when the sequences are
> written out from the alignment? I think it should still be possible to
> index the column number via the (gapless) sequence number... or did I
> get confused? There are two levels of confusion here (on my part), 1)
> the concepts behind the objects and 2) the implementation details.

Mentioned above (no assumptions on how locatableseqs map to one  
another).  WYSIWYG.  There is nothing precluding you from writing up  
code to do that, though it doesn't belong in SimpleAlign.  Maybe  
Bio::Align::Utilities for post-processing padding, or  
Bio::Tools::PurePerlAlign for a pure perl alignment implementation  
(there are, believe it or not, pure perl implementations of Smith- 
Waterman and Needleman-Wunsch.

> Thanks for any hints on how to understand or potentially how to fix
> these problems.
>
> Cheers,
> Dan.

Not that SimpleAlign and LocatableSeqs don't have their share of  
problems.  However, I don't think you can expect this behavior to  
change with the refactors.

chris