[Bioperl-l] Bio::SimpleAlign constructor?

Kevin Brown Kevin.M.Brown at asu.edu
Mon Aug 24 21:48:35 UTC 2009


You can use Bio::SimpleAlign for those tasks, but you, the programmer,
have to remember that you didn't front pad the sequence and so can't
utilize certain functions blindly. I've used SimpleAlign with
LocatableSeq objects and wrote a few custom methods that did things like
creating slices from the simplealign for each locatableseq.

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Dan Bolser
Sent: Monday, August 24, 2009 12:13 PM
To: Chris Fields
Cc: bioperl-l at lists.open-bio.org; Mark A. Jensen; Paolo Pavan
Subject: Re: [Bioperl-l] Bio::SimpleAlign constructor?

Thanks for these clarifications Chris.

Basically I'm looking for an object that will easily let me edit a
multiple sequence alignment, including: adding sequences (with given
alignments), opening gaps, extracting columns (with linked sequences),
transferring features, etc. etc.

For example, I may want to analyse a set of short reads aligned
against the human genome. Somehow it felt natural to represent the
position of the aligned read as a Bio::LocatableSeq (with the
alignment details being captured by a sequence string (including gaps)
representing the read and the reference sequence - basically because
that is what the aligner gives me). Now, you're saying
Bio::LocatableSeq is not suitable for that purpose, which is fine. But
the question is, how should I be doing this? Adding megabases of gaps
to thousands of short reads feels wrong... is there a 'correct' way to
do this currently in BioPerl?

I think the source of my confusion was that SimpleAlign takes
Bio::LocatableSeq as input, and I thought that was 'the way' to
represent sequences in the MSA.

I'll keep hacking at what I need to get done and I'll post the code.
I'm just wondering how much 'alignment editing' could be usefully done
by a suitable object within BP?


Thanks again for your help,
Dan.

2009/8/24 Chris Fields <cjfields at illinois.edu>:
> Dan, all,
>
> Bio::SimpleAlign doesn't align anything for you.  It makes no
assumptions
> about the data being added, beyond possibly checking for the seqs to
be
> flush prior to analyses.
>
> Here's the reason why:
>
> The object doesn't 'know' the seqs map across from one to the other as
> below:
>
>> ...
>> ## REF  tacattaaagacccg
>> ## SEQ1 taca.taaa......
>> ## SEQ2 .....taaaga.ccg
>>
>> my $aln = Bio::SimpleAlign->new();
>>
>> $aln->gap_char('.');
>>
>> my $r  = Bio::LocatableSeq->new( -id=>'r', -seq=>'tacattaaagacccg' );
>> my $s1 = Bio::LocatableSeq->new( -id=>'s1', -start=>1,
-seq=>'taca.taaa'
>> );
>> my $s2 = Bio::LocatableSeq->new( -id=>'s2', -start=>6,
-seq=>'taaaga.ccg'
>> );
>>
>> $aln->add_seq( $r );
>> $aln->add_seq( $s1 );
>> $aln->add_seq( $s2 );
>
> Above, you are making the assumption that SimpleAlign 'knows' where to
match
> the start of $s1 and $s2 to the ref sequence $r.
LocatableSeq::start() does
> NOT indicate that (the LocatableSeq docs, and their usage, should
indicate
> that).
>
> Think about HSP alignments in a BLAST report; the start/end/strand
> coordinates are where the sequence in the alignment maps to the
original
> query or hit sequence.  They don't indicate where the hit maps to the
query
> (the alignment itself does that in a column-wise fashion).
>
> I'm not sure, maybe it needs to be more explicit in the documentation,
but
> SimpleAlign does not align the sequences for you (and it shouldn't be
> expected to).  There are much better (faster, more accurate) ways to
do
> that.
>
>> if($CLUDGE){
>> foreach(($r, $s1, $s2)){
>>  $_->seq( '.' x ($_->start - 1) . $_->seq )
>> }
>> }
>>
>> ## Prepare an 'output stream' for the alignment:
>> my $aliWriter = Bio::AlignIO->
>>  new( -fh     => \*STDOUT,
>>      -format => 'clustalw',
>>    );
>>
>> warn "\nOUTPUT:\n";
>> $aliWriter->write_aln($aln);
>
> ...
>
>> I was calling the "fill in the gaps yourself" step a CLUDGE because I
>> had expected the alignment object to take care of this for me.  Is
>> there any reason that it couldn't do this 'CLUDGE' automatically? It
>> seems strange that it insists on being passed locatable sequence
>> objects, but then largely ignore the given location.
>>
>> Would it not be possible to have this happen when the sequences are
>> written out from the alignment? I think it should still be possible
to
>> index the column number via the (gapless) sequence number... or did I
>> get confused? There are two levels of confusion here (on my part), 1)
>> the concepts behind the objects and 2) the implementation details.
>
> Mentioned above (no assumptions on how locatableseqs map to one
another).
>  WYSIWYG.  There is nothing precluding you from writing up code to do
that,
> though it doesn't belong in SimpleAlign.  Maybe Bio::Align::Utilities
for
> post-processing padding, or Bio::Tools::PurePerlAlign for a pure perl
> alignment implementation (there are, believe it or not, pure perl
> implementations of Smith-Waterman and Needleman-Wunsch.
>
>> Thanks for any hints on how to understand or potentially how to fix
>> these problems.
>>
>> Cheers,
>> Dan.
>
>
> Not that SimpleAlign and LocatableSeqs don't have their share of
problems.
>  However, I don't think you can expect this behavior to change with
the
> refactors.
>
> chris
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list