[Bioperl-l] alignIO::fasta bug
Chris Fields
cjfields at illinois.edu
Sun Jan 25 22:51:33 EST 2009
Not to mention the wiki's malleable and editable by anyone (so we can
all work on ideas for the interface and tests, then start coding
towards a decent solution).
Jason, any additional ideas? I would like to hammer out the specifics
on how to deal with various symbols, how we extract a subsequence via
subseq(), etc.
-c
On Jan 25, 2009, at 9:10 PM, Mark A. Jensen wrote:
> Chris--
> Yes, I see what you mean. Trickier than I thought. Sounds like its
> time to 'mature' LocatableSeq.
> Wiki's a great place for that-
> MAJ
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
> >
> To: "Mark A. Jensen" <maj at fortinbras.us>
> Cc: "Jason Stajich" <jason at bioperl.org>; "BioPerl List" <bioperl-l at bioperl.org
> >
> Sent: Saturday, January 24, 2009 12:25 AM
> Subject: Re: [Bioperl-l] alignIO::fasta bug
>
>
>>
>> On Jan 22, 2009, at 10:54 PM, Mark A. Jensen wrote:
>>
>>> couple of thoughts...
>>>
>>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>>> >
>>> To: "Jason Stajich" <jason at bioperl.org>
>>> Cc: "BioPerl List" <bioperl-l at bioperl.org>
>>> Sent: Thursday, January 22, 2009 10:54 PM
>>> Subject: Re: [Bioperl-l] alignIO::fasta bug
>>>
>>>> On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:
>>>>
>>>>> FYI - it appears that if the last sequence in a FASTA MSA is
>>>>> all gaps (which happens in some whole genome synteny+multiple
>>>>> alignment chunking like Mercator) no alignment is returned.
>>>>>
>>>>> yuck. It basically comes down to this bit of code where $end
>>>>> would equal sequence length.
>>>>>
>>>>> # If $end <= 0, we have either reached the end of
>>>>> # file in <> or we have encountered some other error
>>>>> if ( $end <= 0 && ) {
>>>>> undef $aln;
>>>>> return $aln;
>>>>> }
>>> [haven't looked at the code, but] On the surface, this looks like
>>> a bit more
>>> responsibility than $end should be expected to handle, so I like
>>> Jason's
>>> solution below better, which is only masquerading as a kludge.
>>>
>>>>>
>>>>> This start/end requirement of locatable seq is nice but kind of
>>>>> a pain where I am managing the map of sequences outside of
>>>>> alignment chunk.
>>>>
>>>> I faced this issue with Mauve parsing (Bio::SeqIO::xmfa). I set
>>>> it up so LocatableSeqs can now have all gaps, but the alphabet
>>>> needs to be set (get a warning otherwise) and start and end
>>>> need to be initiated to 0, which is how I believe Mauve defined
>>>> these. However, should 0-0 be a valid start/end for such a
>>>> sequence? Should we change that to automatically allow start =
>>>> end = X (any position including 0) if a sequence is all gaps or
>>>> empty?
>>>
>>> I don't like any old start==end implying zero length, even under
>>> the condition
>>> that the underlying sequence is empty, since in the "1-origin,
>>> endpoints" model
>>> that pervades BP (as opposed to the "0-origin, length" model of,
>>> say, substr),
>>> the pair ($start, $end) has the strong connotation of "the
>>> residue at $start"
>>> if $start==$end. (At least, it does now that we've fixed
>>> LocatableSeq...)
>>
>> It could just as easily be undef (no start/end). However, see
>> below...
>>
>>> What if we consider 0 to be special, the 'sequence anchor', that
>>> takes up no
>>> real space? I'm thinking of 'point' and 'mark' in emacs, that
>>> actually point at
>>> the interstices between characters, and not the characters
>>> themselves. Or \G
>>> for something perly. Then $start means not just the coordinate,
>>> but the
>>> 'space' before the residue at $start, and $end means the 'space'
>>> after the
>>> residue at $end. If $start==$end > 0, how many residues between
>>> $start and $end?
>>> One. If $start == $end == 0, how many residues? None, because the
>>> anchor
>>> is special, it doesn't take up residues.
>>>
>>> If a sequence is all gaps, what's its length? It has an
>>> understood anchor at 0,
>>> then the gap symbols are removed, so its length is the length of
>>> the anchor
>>> alone, which is zero, and $start = 0 is the 'space' before the
>>> anchor, and
>>> $end = 0 is the 'space' after the anchor.
>>
>> Where this becomes sticky is taking alignment sections, a problem
>> encountered when attempting to create interleaved output, for
>> instance. Let's say (for example) you have a long alignment that
>> has large gaps (think genome alignment ala MUMmer or Mauve). If
>> you take a subsection of that alignment, you could very possibly
>> end up with a sequence that is all gaps. Here is an example from
>> Rfam (note AE013109.1):
>>
>> AL596170.1/181975-181872
>> AG................................................
>> BA000016.3/1419453-1419329
>> UU..........................................AAUUAU
>> AE001437.1/2206791-2206643
>> AGGAAUUAUUUUGAAGAACUUUAUAGACGUAGAA..........GAAAAA
>> BA000016.3/1419328-1419245
>> GG................................................
>> AE013183.1/26-124 AG.............................................GAU
>> AE013109.1/12950-12813
>> GAGA........................................AUAUAG
>> AE013109.1 / 13583
>> -13495 ..................................................
>> AF044978.1/176-335 AAUC........................................AUGCAA
>> X84262.1/148-332 CAGCUCAAAAUCCAUAUAUAA.......................ACAAGA
>> CR954253.1/898444-898260
>> UCUCUCAAAGUCUA..............................CUUAAA
>>
>> So let's say we take the slice above from the alignment and create
>> a new SimpleAlign. We know the original coordinates for each
>> LocatableSeq; should we override them and mark the start = end =
>> 0? Or would we want to have a sequence that is indicated as
>> between the end of the last known position and the beginning of
>> the next (500^501)? This also doesn't take into account start/end
>> gaps; maybe for beginning gaps start = end = 0 and end gaps start
>> = end = length.
>>
>> For no sequence at all (undef) start = end = undef as well.
>>
>>> Now this would mean say, $s->subseq(0, $n) eq $s->subseq(1, $n).
>>> I don't
>>> think this is too troubling, since 1) it's consistent with the
>>> concept above, and
>>> 2) zeros would only show up when the empty sequences are
>>> encountered.
>>
>> Okay.
>>
>>>> If we come up with some rough ideas of how to handle this we can
>>>> add some examples to the test suite and try getting
>>>> LocatableSeq to do the right thing. We can always mark them as
>>>> TODO.
>>>>
>>>>> Why not just check to see that the number of seq characters is
>>>>> 0 - an all-gapped sequence as the last sequence of the file
>>>>> should still be legal.
>>>>>
>>>>> Instead:
>>>>> if ( length($seqchar) == 0 ) {
>>>>> undef $aln;
>>>>> return $aln;
>>>>> }
>>>>>
>>>>> Although that would invalidate an empty alignment like this --
>>>>> do we want to still permit these?
>>>>> >A
>>>>>
>>>>> >B
>>>>>
>>>>> >C
>>>>
>>>> These could be zero-length, empty seqs (start = end = undef).
>>>> I thought something was added to PrimarySeq recently for empty
>>>> seqs.
>>>
>>> Maybe the above concept would encompass empty sequences (gapped
>>> or ungapped) without recourse to undef.
>>>
>>> cheers, Mark
>>
>> From this discussion I think it's worth coming up with some kind
>> of idea on what we expect LocatableSeq to do under various
>> circumstances; the wiki is a good place to start something up. I
>> don't think that refactoring everything is necessary, but it might
>> be worth putting out some variations for consideration.
>>
>> For instance I have had the idea of making a new RangeI-based
>> Bio::SeqI that would just delegate the RangeI methods to a
>> 'source' SeqFeature containing a Bio::LocationI. The PrimarySeqI
>> would be a simplified LocatableSeq capable of dealing with subseq
>> issues as discussed previously, but simplified. In fact, it should
>> probably recognize any PrimarySeqI.
>>
>> Advantages: can add annotation and features specific for the
>> sequence, and (beyond simple decorated methods used to make access
>> simple) it is a Bio::Seq and could be persisted in a BioSQL.
>> Disadvantages: lots of objects and thus slower.
>>
>> chris
>>
>
More information about the Bioperl-l
mailing list