[Bioperl-l] split seq feature and fuzzy feature proposal

Hilmar Lapp lapp@gnf.org
Thu, 18 Jan 2001 17:28:09 -0800

Ewan Birney wrote:
> >
> > min_start()/max_start() etc should also be included. start() and
> > end() in an implementation are overridden and throw exceptions,
> > depending on which end is uncertain (and least they should be
> > expected to throw exceptions). A certain end can be determined by
> > min_start() == max_start() (or .._end(), resp.).
> I would be in favour or min_start/max_start but against letting start
> throw an exception. The implementation has to decide how to "become a hard
> feature" from being Fuzzy. It is up to the implementation. As long as this
> is documented, this is no more arbitary than letting the client decide.

I think it is more arbitrary, and I'll tell you why. There is more
than one interpretation of fuzzy locations. I name two for which I
think the BioPerl core is not in a position to take the decision from
the client, which is why it shouldn't pretend that it is:
1) Uncertainty about the real location, that is, it is clear that the
described feature sits at a particular position, but for one reason or
another the producer of the feature can only give an estimated range
for start and/or end. Now, we can implement (and document) the rule
that in such cases $feature->start() and $feature->end() will always
return the widest (or smallest, or average, make your choice) possible
range. A client is then free to rely on it, thinking that what the
BioPerl developers decided for is probably the wisest choice you can
make. That's already catch #1. Catch #2 happens if there is a user of
the client program who, because he's a good user, read the
documentation of the client program, but not that of BioPerl. Do we
request users of programs that use BioPerl to read through the BioPerl
documentation as well?
2) The location is undefined. A location saying <1..100 is undefined
for that feature in its biological meaning. You're not supposed to
make up a value for an undefined value. If you had an interface
dividing two integers and returning an integer (to prevent you from
responding NAN or INF), and the denominator is zero, what do you

I strongly believe that every client that does something sensible with
the feature coordinates should know, and should be required to make
sure in order to be safe from an exception, what type of coordinates
it is dealing with. It is not the task of BioPerl to relieve the
client from thinking, but it is its task to provide every information
the client needs for making an educated decision.

You can always divide by a number without checking for zero, but by
doing so you accept the risk that some day you might get an exception.
The same holds for clients calling $feature->start() instead of
obtaining the location object and examining it for its capabilities.

Maybe I'm missing an important point in having $feature->start()
guaranteed to be exception-free.

> >
> > I indeed like the decoupled approach much better.
> >
> If we go for a decoupled approach I am keen on it being justified by more
> than just "it feels good". We are increasing the complexity here alot and
> we need justification...

First for clarification: I thought we agree that we have different
interfaces, that is, SeqFeatureI (ISA RangeI) and LocationI (ISA
RangeI), don't we?

Regarding complexity, the question is whether we better have
subinterfaces for each of FuzzyLocation, CompoundLocation, etc (what
is etc?), or whether we pack all into one interface. I have a
preference for the first, because it let's you find out the type of
location by checking $loc->isa('Bio::SomeLocationInterface'). I maybe
missing another equally elegant way if everything's in one interface.

The increase in complexity is fairly little I think. All interfaces
can be put into their own subdirectory (Bio::Loc?). Only those people
are really concerned with it who want to deal with the coordinates in
a very reliable way (that is, avoid exceptions and deal with any
possible sort of location type). And these people really should care
what type of location they could encounter, and they mean. Everyone
else could simply use LocationI which in essence is probably the same
as RangeI.

Regarding your point that there can be many implementations of an
interface, sure that's true. In principle I have no problem with
$feature->location() returning $self, assuming that the SeqFeature
object implements LocationI itself. But I do think it's bad if a
SeqFeature implements every type of location interface itself, because
if I wanted to change the type of a feature's location I would end up
instantiating a SeqFeature passed to a SeqFeature as its location
object, which is weird isn't it. I say weird because it's not
lightweight. No more of those beast-like classes, please. I don't
think the reduction in hierarchy complexity achieved by beast classes
makes them easier to learn, or to use. 

You may ask why I wish to change the type of a location. Consider a
client program that draws features. When it encounters a feature with
a FuzzyLocation, it may want to ask the user what to do. The user may
even be able to set a preference like 'always take the widest possible
range'. Then the client program simply replaces the FuzzyLocation with
a Range object denoting the widest possible range and passes the
feature on to the drawing module. No code change necessary there. And
the user knows what he's doing, it's not just an arbitrary decision of
a backend library.

So, I still think that having not only individual interfaces, but also
individual implementations for the different location types is
justified, doesn't add too much complexity (in fact, it reduces hidden
complexity), and provides a clear API for programmers.

Long mail, sorry for wasting your time to read it, but you asked.


Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757