[Bioperl-l] alignIO::fasta bug

Sat Jan 24 05:25:16 UTC 2009

On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:

> couple of thoughts...
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
> >
> To: "Jason Stajich" <jason at bioperl.org>
> Cc: "BioPerl List" <bioperl-l at bioperl.org>
> Sent: Thursday, January 22, 2009 10:54 PM
> Subject: Re: [Bioperl-l] alignIO::fasta bug
>
>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>
>>> FYI - it appears that if the last sequence in a FASTA MSA is all   
>>> gaps (which happens in some whole genome synteny+multiple  
>>> alignment  chunking like Mercator) no alignment is returned.
>>>
>>> yuck. It basically comes down to this bit of code where $end  
>>> would  equal sequence length.
>>>
>>> #  If $end <= 0, we have either reached the end of
>>> #  file in <> or we have encountered some other error
>>> if ( $end <= 0 &&  ) {
>>> undef $aln;
>>> return $aln;
>>> }
> [haven't looked at the code, but] On the surface, this looks like a  
> bit more
> responsibility than $end should be expected to handle, so I like  
> Jason's
> solution below better, which is only masquerading as a kludge.
>
>>>
>>> This start/end requirement of locatable seq is nice but kind of a   
>>> pain where I am managing the map of sequences outside of  
>>> alignment  chunk.
>>
>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa).  I set it  
>> up  so LocatableSeqs can now have all gaps, but the alphabet needs  
>> to be  set (get a warning otherwise) and start and end need to be  
>> initiated  to 0, which is how I believe Mauve defined these.   
>> However, should 0-0  be a valid start/end for such a sequence?   
>> Should we change that to  automatically allow start = end = X (any  
>> position including 0) if a  sequence is all gaps or empty?
>
> I don't like any old start==end implying zero length, even under the  
> condition
> that the underlying sequence is empty, since in the "1-origin,  
> endpoints" model
> that pervades BP (as opposed to the "0-origin, length" model of,  
> say, substr),
> the pair ($start, $end) has the strong connotation of "the residue  
> at $start"
> if $start==$end. (At least, it does now that we've fixed  
> LocatableSeq...)

It could just as easily be undef (no start/end).  However, see below...

> What if we consider 0 to be special, the 'sequence anchor', that  
> takes up no
> real space? I'm thinking of 'point' and 'mark' in emacs, that  
> actually point at
> the interstices between characters, and not the characters  
> themselves. Or \G
> for something perly. Then $start means not just the coordinate, but  
> the
> 'space' before the residue at $start,  and $end means the 'space'  
> after the
> residue at $end. If $start==$end > 0, how many residues between  
> $start and $end?
> One. If $start == $end == 0, how many residues? None, because the  
> anchor
> is special, it doesn't take up residues.
>
> If a sequence is all gaps, what's its length? It has an understood  
> anchor at 0,
> then the gap symbols are removed, so its length is the length of the  
> anchor
> alone, which is zero, and $start = 0 is the 'space' before the  
> anchor, and
> $end = 0 is the 'space' after the anchor.

Where this becomes sticky is taking alignment sections, a problem  
encountered when attempting to create interleaved output, for  
instance.  Let's say (for example) you have a long alignment that has  
large gaps (think genome alignment ala MUMmer or Mauve).  If you take  
a subsection of that alignment, you could very possibly end up with a  
sequence that is all gaps.  Here is an example from Rfam (note  
AE013109.1):

AL596170.1/181975-181872                  
AG................................................
BA000016.3/1419453-1419329                
UU..........................................AAUUAU
AE001437.1/2206791-2206643                
AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
BA000016.3/1419328-1419245                
GG................................................
AE013183.1/26-124                         
AG.............................................GAU
AE013109.1/12950-12813                    
GAGA........................................AUAUAG
AE013109.1 
/ 
13583 
-13495 
                    ..................................................
AF044978.1/176-335                        
AAUC........................................AUGCAA
X84262.1/148-332                          
CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
CR954253.1/898444-898260                  
UCUCUCAAAGUCUA..............................CUUAAA

So let's say we take the slice above from the alignment and create a  
new SimpleAlign.  We know the original coordinates for each  
LocatableSeq; should we override them and mark the start = end = 0?   
Or would we want to have a sequence that is indicated as between the  
end of the last known position and the beginning of the next  
(500^501)?  This also doesn't take into account start/end gaps; maybe  
for beginning gaps start = end = 0 and end gaps start = end = length.

For no sequence at all (undef) start = end = undef as well.

> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n). I  
> don't
> think this is too troubling, since 1) it's consistent with the  
> concept above, and
> 2) zeros would only show up when the empty sequences are encountered.

Okay.

>> If we come up with some rough ideas of how to handle this we can  
>> add  some examples to the test suite and try getting LocatableSeq  
>> to do the  right thing.  We can always mark them as TODO.
>>
>>> Why not just check to see that the number of seq characters is 0  
>>> -  an all-gapped sequence as the last sequence of the file should  
>>> still  be legal.
>>>
>>> Instead:
>>> if ( length($seqchar) == 0 ) {
>>> undef $aln;
>>> return $aln;
>>> }
>>>
>>> Although that would invalidate an empty alignment like this -- do  
>>> we  want to still permit these?
>>> >A
>>>
>>> >B
>>>
>>> >C
>>
>> These could be zero-length, empty seqs (start = end = undef).  I   
>> thought something was added to PrimarySeq recently for empty seqs.
>
> Maybe the above concept would encompass empty sequences (gapped
> or ungapped) without recourse to undef.
>
> cheers, Mark

 From this discussion I think it's worth coming up with some kind of  
idea on what we expect LocatableSeq to do under various circumstances;  
the wiki is a good place to start something up.  I don't think that  
refactoring everything is necessary, but it might be worth putting out  
some variations for consideration.

For instance I have had the idea of making a new RangeI-based  
Bio::SeqI that would just delegate the RangeI methods to a 'source'  
SeqFeature containing a Bio::LocationI.  The PrimarySeqI would be a  
simplified LocatableSeq capable of dealing with subseq issues as  
discussed previously, but simplified. In fact, it should probably  
recognize any PrimarySeqI.

Advantages: can add annotation and features specific for the sequence,  
and (beyond simple decorated methods used to make access simple) it is  
a Bio::Seq and could be persisted in a BioSQL.  Disadvantages: lots of  
objects and thus slower.

chris