[Bioperl-l] alignIO::fasta bug
Chris Fields
cjfields at illinois.edu
Thu Jan 22 22:54:09 EST 2009
On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
> FYI - it appears that if the last sequence in a FASTA MSA is all
> gaps (which happens in some whole genome synteny+multiple alignment
> chunking like Mercator) no alignment is returned.
>
> yuck. It basically comes down to this bit of code where $end would
> equal sequence length.
>
> # If $end <= 0, we have either reached the end of
> # file in <> or we have encountered some other error
> if ( $end <= 0 && ) {
> undef $aln;
> return $aln;
> }
>
> This start/end requirement of locatable seq is nice but kind of a
> pain where I am managing the map of sequences outside of alignment
> chunk.
I faced this issue with Mauve parsing (Bio::SeqIO::xmfa). I set it up
so LocatableSeqs can now have all gaps, but the alphabet needs to be
set (get a warning otherwise) and start and end need to be initiated
to 0, which is how I believe Mauve defined these. However, should 0-0
be a valid start/end for such a sequence? Should we change that to
automatically allow start = end = X (any position including 0) if a
sequence is all gaps or empty?
If we come up with some rough ideas of how to handle this we can add
some examples to the test suite and try getting LocatableSeq to do the
right thing. We can always mark them as TODO.
> Why not just check to see that the number of seq characters is 0 -
> an all-gapped sequence as the last sequence of the file should still
> be legal.
>
> Instead:
> if ( length($seqchar) == 0 ) {
> undef $aln;
> return $aln;
> }
>
> Although that would invalidate an empty alignment like this -- do we
> want to still permit these?
> >A
>
> >B
>
> >C
These could be zero-length, empty seqs (start = end = undef). I
thought something was added to PrimarySeq recently for empty seqs.
> Also I've locally implemented possibility of parsing start/end from
> the header line that is part of Mercator output and I think a
> variant of UCSC headers that
> look like this for MFA.
> >Cp scaffold_1.3086:644980-660265+
Is there a way we can have a callback option for parsing out the
data? e.g. pass in everything after '>' and the LocatableSeq
instance, the callback parses the string for whatever info and sets
the LocatableSeq attributes accordingly. We could default to some
built-in coderefs for common regexes if a callback isn't defined.
> I can commit this to main-trunk so as to not interfere with branch
> release. Do any of the LocatableSeq/AlignIO::Fasta maintainers want
> to comment?
>
> -jason
+1. I think we should incorporate this into 1.6.1 along with any
other LocatableSeq fixes and tests.
chris
More information about the Bioperl-l
mailing list