[Bioperl-l] Bio::Location::Fuzzy, Bio::Location::Split

Mark Dalphin mdalphin@amgen.com
Thu, 25 Jan 2001 14:06:48 -0800


Hilmar Lapp wrote:

> > >    What about in Fuzzy, do we want to throw exceptions or do we just use
> > >    the best information we have and do some logic and coordinate
> > >    gymnastics to try and return a reasonable value or else throw an
> > >    exception?
> > My gut says to return the inner-most coordinate that is known but
> > provide API to get the full fuzzy coordinates out - so
> >
> > full loc           -> start..end : minStart..maxEnd
> > <50..100>          -> 50..100    : -INF..+INF
> > (78.90)..(100.107) -> 90..100    : 78..107
> >
>
> I think I am much more in favor of returning the outer-most
> coordinates as the default policy. David, Mark? I'm also not sure
> whether INF or NaN are good return values in perl (i.e., can you test
> for INF or NaN by numeric comparison? I figured that e.g. you can't
> obtain NaN by sqrt(-1), as would be the result in C).
>
>         Hilmar

My inclination is also to select the outer-most ranges for the defined regions. I
understand the reason for selected the "certainty" of the inner ranges, but most
of the biologists here (it seems to me...) would rather have "weak data showing
some potential" rather than "more certain data which risks missing something".
 This is a philosopical issue that involves many end-users.  I would end up
writing it to take the outer-most to please my customers, but I am not sure that
it doesn't just give them more noise to wade through.

For the uncertain edges, ie '<' and '>' I am not certain how best to handle them
in Perl.  There are really several cases here:
    1) The most common in GenBank, I believe is where you just don't have more
sequence so you end up with:
        CDS   <1..>$Seq_Len
    Here we are saying that we don't even have sequence to go with.  Displaying
it is not really a problem, usually.

    2) An uglier problem is when a gene-prediction program predicts an intial
"exon".  This "exon" is really only part of an exon as the program only predicts
coding sequence and ignores the 5'-UTR. This might lead to:
        exon     <105..300
        CDS   join(105..300, 405..1004)
    Here we have the upstream sequence (5'UTR) and know it extends directly
upstream of position 105, but we don't really know where.

I don't really know what to do with these.  I think the best we can do is
indicate it with a flag, similar to '<' or '>', whether we are drawing a picture
or trying to extract in "interesting" sequence from a genomic fragment.  I don't
think returning NAN or INF is correct; we have an uncertainty, but we certainly
don't have INF or even NAN. We need to pass on this "uncertainty" to the calling
program for it to express to the user.

Mark

--
Mark Dalphin                          email: mdalphin@amgen.com
Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
One Amgen Center Drive                       +1-805-375-0680 (home)
Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)