[Bioperl-l] Bio::SimpleAlign constructor?

Mon Aug 24 19:13:26 UTC 2009

Thanks for these clarifications Chris.

Basically I'm looking for an object that will easily let me edit a
multiple sequence alignment, including: adding sequences (with given
alignments), opening gaps, extracting columns (with linked sequences),
transferring features, etc. etc.

For example, I may want to analyse a set of short reads aligned
against the human genome. Somehow it felt natural to represent the
position of the aligned read as a Bio::LocatableSeq (with the
alignment details being captured by a sequence string (including gaps)
representing the read and the reference sequence - basically because
that is what the aligner gives me). Now, you're saying
Bio::LocatableSeq is not suitable for that purpose, which is fine. But
the question is, how should I be doing this? Adding megabases of gaps
to thousands of short reads feels wrong... is there a 'correct' way to
do this currently in BioPerl?

I think the source of my confusion was that SimpleAlign takes
Bio::LocatableSeq as input, and I thought that was 'the way' to
represent sequences in the MSA.

I'll keep hacking at what I need to get done and I'll post the code.
I'm just wondering how much 'alignment editing' could be usefully done
by a suitable object within BP?

Thanks again for your help,
Dan.

2009/8/24 Chris Fields <cjfields at illinois.edu>:
> Dan, all,
>
> Bio::SimpleAlign doesn't align anything for you.  It makes no assumptions
> about the data being added, beyond possibly checking for the seqs to be
> flush prior to analyses.
>
> Here's the reason why:
>
> The object doesn't 'know' the seqs map across from one to the other as
> below:
>
>> ...
>> ## REF  tacattaaagacccg
>> ## SEQ1 taca.taaa......
>> ## SEQ2 .....taaaga.ccg
>>
>> my $aln = Bio::SimpleAlign->new();
>>
>> $aln->gap_char('.');
>>
>> my $r  = Bio::LocatableSeq->new( -id=>'r', -seq=>'tacattaaagacccg' );
>> my $s1 = Bio::LocatableSeq->new( -id=>'s1', -start=>1, -seq=>'taca.taaa'
>> );
>> my $s2 = Bio::LocatableSeq->new( -id=>'s2', -start=>6, -seq=>'taaaga.ccg'
>> );
>>
>> $aln->add_seq( $r );
>> $aln->add_seq( $s1 );
>> $aln->add_seq( $s2 );
>
> Above, you are making the assumption that SimpleAlign 'knows' where to match
> the start of $s1 and $s2 to the ref sequence $r.  LocatableSeq::start() does
> NOT indicate that (the LocatableSeq docs, and their usage, should indicate
> that).
>
> Think about HSP alignments in a BLAST report; the start/end/strand
> coordinates are where the sequence in the alignment maps to the original
> query or hit sequence.  They don't indicate where the hit maps to the query
> (the alignment itself does that in a column-wise fashion).
>
> I'm not sure, maybe it needs to be more explicit in the documentation, but
> SimpleAlign does not align the sequences for you (and it shouldn't be
> expected to).  There are much better (faster, more accurate) ways to do
> that.
>
>> if($CLUDGE){
>> foreach(($r, $s1, $s2)){
>>  $_->seq( '.' x ($_->start - 1) . $_->seq )
>> }
>> }
>>
>> ## Prepare an 'output stream' for the alignment:
>> my $aliWriter = Bio::AlignIO->
>>  new( -fh     => \*STDOUT,
>>      -format => 'clustalw',
>>    );
>>
>> warn "\nOUTPUT:\n";
>> $aliWriter->write_aln($aln);
>
> ...
>
>> I was calling the "fill in the gaps yourself" step a CLUDGE because I
>> had expected the alignment object to take care of this for me.  Is
>> there any reason that it couldn't do this 'CLUDGE' automatically? It
>> seems strange that it insists on being passed locatable sequence
>> objects, but then largely ignore the given location.
>>
>> Would it not be possible to have this happen when the sequences are
>> written out from the alignment? I think it should still be possible to
>> index the column number via the (gapless) sequence number... or did I
>> get confused? There are two levels of confusion here (on my part), 1)
>> the concepts behind the objects and 2) the implementation details.
>
> Mentioned above (no assumptions on how locatableseqs map to one another).
>  WYSIWYG.  There is nothing precluding you from writing up code to do that,
> though it doesn't belong in SimpleAlign.  Maybe Bio::Align::Utilities for
> post-processing padding, or Bio::Tools::PurePerlAlign for a pure perl
> alignment implementation (there are, believe it or not, pure perl
> implementations of Smith-Waterman and Needleman-Wunsch.
>
>> Thanks for any hints on how to understand or potentially how to fix
>> these problems.
>>
>> Cheers,
>> Dan.
>
>
> Not that SimpleAlign and LocatableSeqs don't have their share of problems.
>  However, I don't think you can expect this behavior to change with the
> refactors.
>
> chris
>