[Bioperl-l] Bio::SimpleAlign constructor?
Chris Fields
cjfields at illinois.edu
Mon Aug 24 13:36:32 UTC 2009
Dan, all,
Bio::SimpleAlign doesn't align anything for you. It makes no
assumptions about the data being added, beyond possibly checking for
the seqs to be flush prior to analyses.
Here's the reason why:
The object doesn't 'know' the seqs map across from one to the other as
below:
> ...
> ## REF tacattaaagacccg
> ## SEQ1 taca.taaa......
> ## SEQ2 .....taaaga.ccg
>
> my $aln = Bio::SimpleAlign->new();
>
> $aln->gap_char('.');
>
> my $r = Bio::LocatableSeq->new( -id=>'r', -seq=>'tacattaaagacccg' );
> my $s1 = Bio::LocatableSeq->new( -id=>'s1', -start=>1, -
> seq=>'taca.taaa' );
> my $s2 = Bio::LocatableSeq->new( -id=>'s2', -start=>6, -
> seq=>'taaaga.ccg' );
>
> $aln->add_seq( $r );
> $aln->add_seq( $s1 );
> $aln->add_seq( $s2 );
Above, you are making the assumption that SimpleAlign 'knows' where to
match the start of $s1 and $s2 to the ref sequence $r.
LocatableSeq::start() does NOT indicate that (the LocatableSeq docs,
and their usage, should indicate that).
Think about HSP alignments in a BLAST report; the start/end/strand
coordinates are where the sequence in the alignment maps to the
original query or hit sequence. They don't indicate where the hit
maps to the query (the alignment itself does that in a column-wise
fashion).
I'm not sure, maybe it needs to be more explicit in the documentation,
but SimpleAlign does not align the sequences for you (and it shouldn't
be expected to). There are much better (faster, more accurate) ways
to do that.
> if($CLUDGE){
> foreach(($r, $s1, $s2)){
> $_->seq( '.' x ($_->start - 1) . $_->seq )
> }
> }
>
> ## Prepare an 'output stream' for the alignment:
> my $aliWriter = Bio::AlignIO->
> new( -fh => \*STDOUT,
> -format => 'clustalw',
> );
>
> warn "\nOUTPUT:\n";
> $aliWriter->write_aln($aln);
...
> I was calling the "fill in the gaps yourself" step a CLUDGE because I
> had expected the alignment object to take care of this for me. Is
> there any reason that it couldn't do this 'CLUDGE' automatically? It
> seems strange that it insists on being passed locatable sequence
> objects, but then largely ignore the given location.
>
> Would it not be possible to have this happen when the sequences are
> written out from the alignment? I think it should still be possible to
> index the column number via the (gapless) sequence number... or did I
> get confused? There are two levels of confusion here (on my part), 1)
> the concepts behind the objects and 2) the implementation details.
Mentioned above (no assumptions on how locatableseqs map to one
another). WYSIWYG. There is nothing precluding you from writing up
code to do that, though it doesn't belong in SimpleAlign. Maybe
Bio::Align::Utilities for post-processing padding, or
Bio::Tools::PurePerlAlign for a pure perl alignment implementation
(there are, believe it or not, pure perl implementations of Smith-
Waterman and Needleman-Wunsch.
> Thanks for any hints on how to understand or potentially how to fix
> these problems.
>
> Cheers,
> Dan.
Not that SimpleAlign and LocatableSeqs don't have their share of
problems. However, I don't think you can expect this behavior to
change with the refactors.
chris
More information about the Bioperl-l
mailing list