[Bioperl-l] alignIO::fasta bug
Chris Fields
cjfields at illinois.edu
Sat Jan 24 00:25:16 EST 2009
On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:
> couple of thoughts...
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
> >
> To: "Jason Stajich" <jason at bioperl.org>
> Cc: "BioPerl List" <bioperl-l at bioperl.org>
> Sent: Thursday, January 22, 2009 10:54 PM
> Subject: Re: [Bioperl-l] alignIO::fasta bug
>
>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>
>>> FYI - it appears that if the last sequence in a FASTA MSA is all
>>> gaps (which happens in some whole genome synteny+multiple
>>> alignment chunking like Mercator) no alignment is returned.
>>>
>>> yuck. It basically comes down to this bit of code where $end
>>> would equal sequence length.
>>>
>>> # If $end <= 0, we have either reached the end of
>>> # file in <> or we have encountered some other error
>>> if ( $end <= 0 && ) {
>>> undef $aln;
>>> return $aln;
>>> }
> [haven't looked at the code, but] On the surface, this looks like a
> bit more
> responsibility than $end should be expected to handle, so I like
> Jason's
> solution below better, which is only masquerading as a kludge.
>
>>>
>>> This start/end requirement of locatable seq is nice but kind of a
>>> pain where I am managing the map of sequences outside of
>>> alignment chunk.
>>
>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa). I set it
>> up so LocatableSeqs can now have all gaps, but the alphabet needs
>> to be set (get a warning otherwise) and start and end need to be
>> initiated to 0, which is how I believe Mauve defined these.
>> However, should 0-0 be a valid start/end for such a sequence?
>> Should we change that to automatically allow start = end = X (any
>> position including 0) if a sequence is all gaps or empty?
>
> I don't like any old start==end implying zero length, even under the
> condition
> that the underlying sequence is empty, since in the "1-origin,
> endpoints" model
> that pervades BP (as opposed to the "0-origin, length" model of,
> say, substr),
> the pair ($start, $end) has the strong connotation of "the residue
> at $start"
> if $start==$end. (At least, it does now that we've fixed
> LocatableSeq...)
It could just as easily be undef (no start/end). However, see below...
> What if we consider 0 to be special, the 'sequence anchor', that
> takes up no
> real space? I'm thinking of 'point' and 'mark' in emacs, that
> actually point at
> the interstices between characters, and not the characters
> themselves. Or \G
> for something perly. Then $start means not just the coordinate, but
> the
> 'space' before the residue at $start, and $end means the 'space'
> after the
> residue at $end. If $start==$end > 0, how many residues between
> $start and $end?
> One. If $start == $end == 0, how many residues? None, because the
> anchor
> is special, it doesn't take up residues.
>
> If a sequence is all gaps, what's its length? It has an understood
> anchor at 0,
> then the gap symbols are removed, so its length is the length of the
> anchor
> alone, which is zero, and $start = 0 is the 'space' before the
> anchor, and
> $end = 0 is the 'space' after the anchor.
Where this becomes sticky is taking alignment sections, a problem
encountered when attempting to create interleaved output, for
instance. Let's say (for example) you have a long alignment that has
large gaps (think genome alignment ala MUMmer or Mauve). If you take
a subsection of that alignment, you could very possibly end up with a
sequence that is all gaps. Here is an example from Rfam (note
AE013109.1):
AL596170.1/181975-181872
AG................................................
BA000016.3/1419453-1419329
UU..........................................AAUUAU
AE001437.1/2206791-2206643
AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
BA000016.3/1419328-1419245
GG................................................
AE013183.1/26-124
AG.............................................GAU
AE013109.1/12950-12813
GAGA........................................AUAUAG
AE013109.1
/
13583
-13495
..................................................
AF044978.1/176-335
AAUC........................................AUGCAA
X84262.1/148-332
CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
CR954253.1/898444-898260
UCUCUCAAAGUCUA..............................CUUAAA
So let's say we take the slice above from the alignment and create a
new SimpleAlign. We know the original coordinates for each
LocatableSeq; should we override them and mark the start = end = 0?
Or would we want to have a sequence that is indicated as between the
end of the last known position and the beginning of the next
(500^501)? This also doesn't take into account start/end gaps; maybe
for beginning gaps start = end = 0 and end gaps start = end = length.
For no sequence at all (undef) start = end = undef as well.
> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n). I
> don't
> think this is too troubling, since 1) it's consistent with the
> concept above, and
> 2) zeros would only show up when the empty sequences are encountered.
Okay.
>> If we come up with some rough ideas of how to handle this we can
>> add some examples to the test suite and try getting LocatableSeq
>> to do the right thing. We can always mark them as TODO.
>>
>>> Why not just check to see that the number of seq characters is 0
>>> - an all-gapped sequence as the last sequence of the file should
>>> still be legal.
>>>
>>> Instead:
>>> if ( length($seqchar) == 0 ) {
>>> undef $aln;
>>> return $aln;
>>> }
>>>
>>> Although that would invalidate an empty alignment like this -- do
>>> we want to still permit these?
>>> >A
>>>
>>> >B
>>>
>>> >C
>>
>> These could be zero-length, empty seqs (start = end = undef). I
>> thought something was added to PrimarySeq recently for empty seqs.
>
> Maybe the above concept would encompass empty sequences (gapped
> or ungapped) without recourse to undef.
>
> cheers, Mark
From this discussion I think it's worth coming up with some kind of
idea on what we expect LocatableSeq to do under various circumstances;
the wiki is a good place to start something up. I don't think that
refactoring everything is necessary, but it might be worth putting out
some variations for consideration.
For instance I have had the idea of making a new RangeI-based
Bio::SeqI that would just delegate the RangeI methods to a 'source'
SeqFeature containing a Bio::LocationI. The PrimarySeqI would be a
simplified LocatableSeq capable of dealing with subseq issues as
discussed previously, but simplified. In fact, it should probably
recognize any PrimarySeqI.
Advantages: can add annotation and features specific for the sequence,
and (beyond simple decorated methods used to make access simple) it is
a Bio::Seq and could be persisted in a BioSQL. Disadvantages: lots of
objects and thus slower.
chris
More information about the Bioperl-l
mailing list