[Bioperl-l] Bio::Location::{Simple,Fuzzy} and "IN-BETWEEN"

George Hartzell hartzell at alerce.com
Thu Mar 19 17:54:12 UTC 2009


Heikki Lehvaslaiho writes:
 > George,
 > 
 > Chris is right.
 > 
 > You are not suppose to use fuzzy ever!. It was introduced only because
 > in the olden times sequencing was diffucult and you knew that your
 > sequence feature starts before your actual sequence. The early
 > EMBL/GenBank design decision was to mark that with like "CDS <1..2344"
 > when you knew that your sequence did not start from the start of the
 > coding region.
 > 
 > You annotate something always in relation to the reference sequence.
 > If there is something, like an insertion in Chris' example, you use
 > IN-BETWEEN notation where the start and end have to be adjacent
 > residues. There is nothing fuzzy in that location, so do not try to
 > add it.

Thanks for the feedback.  Sorry for the delay in following up, I was
(yay) skiing for the weekend and then working on a presentation and
digging myself back out (of work, not a snowbank).

I'm kind of stuck with some of this, I need to work with the way my
community thinks about it (e.g. annotating changes 6 bases into an
intron by putting features on a cDNA with positions like 33+6).  I'm
trying to make it more logical going forward, but....

The larger thing that I'm trying to do is track observed changes in
sample sequences relative to some reference (e.g. at position 77 the C
changed to a T, or between position 99 and 100 there was an insert of
TTTTATT, or deletion of the region between 223 and 256).  These
reference sequences are then aligned to a reference genome, frequently
with gaps.  There is also a set of transcripts/genes aligned to the
genome, frequently with gaps.

If everything aligned perfectly, then something that was at 8 in the
target might be at 80 in the genome and then at 18 in the transcript.
Ditto for 8^9 in the target, 80^81 in the genome, and 18^19 in the
transcript.

If there's an insertion of bases 5-10 in the target relative to the
genome, then with Chris's "don't do that" solution (intentionally
overstated, sorry...) none of these features could be
attached/localized to the transcript.

If I mark then as "after 5" and "before 6" with some indication that
they're not really well located and then map that data up to the
transcript than I can still do things like "Tell me all of the
insertions that were observed in exon 3".  It's more problematic than
usual to do things like applying the mutations to the transcript
sequence and assuming that the resulting protein is "correct" or
"real", but that's another story.

If I just use 5^6 then I have a hard time differentiating that from
something that was 5^6 in the target.

One way or the other it seems like I have to carry something around
out of band, designating when something's location is uncertain (I
know that it occured at a position in a related/aligned sequence such
that it's after 5 and before 6 in this sequence) and keeping that
separated from the concept of "IN-BETWEEN".

Is Fuzzy deprecated?  It seems like it's useful for things like being
before the M, or after the end of this exon, or.....

g.



More information about the Bioperl-l mailing list