[Bioperl-l] sequence filtering

Ewan Birney birney@ebi.ac.uk
Wed, 9 Oct 2002 08:13:56 +0100 (BST)


On Tue, 8 Oct 2002, Hilmar Lapp wrote:

> Apparently (unfortunately) it didn't ring a lot of bells for many people.
> I'm still looking forward how Biojava does this exactly, even though I've
> now looked through some of their interfaces.
>
> It seems to me what they do is not terribly different from the following
> interface Bio::Factory::ObjectBuilderI that I propose as a solution. There
> would be an implementation Bio::Seq::SeqBuilder.

I think this is a pretty sane process - in some way the "builder"
configuration object has to expose what it wants (and what it wants is set
by the user), with things like

(in the middle of SeqIO)

    if( $seqbuilder->want_slot('features') ) {
      # feature table parsing
    }


I think if you do it in an event based way, you insert a filter into the
EventStream, ie, there is an interface:

     SeqIOEventGeneratorI

which the parsers comply to, producing Event::SeqIO objects of pre
digested chunks of data, but not objectfied or fully parsed.


One then puts in something with Is-A SeqIOEventGeneratorI and has-a
SeqIOEventGeneratorI called

    SeqIO::Event::Filter

      methods - keep_event, discard_event


which will discard "non relevant events"



I don't know what the best pattern is here - the event pipeline is nice,
but the low level parser has to parse the whole file and because of
formatting differences, is probably going to have a significant overhead
so maybe your slot method is better...



>
> =head2 want_slot
>
>  Title   : want_slot
>  Usage   :
>  Function: Whether or not the object builder wants to populate the
>            specified slot of the object to be built.
>
>            The slot can be specified either as the name of the
>            respective method, or the initialization parameter that
>            would be otherwise passed to new() of the object to be
>            built.
>
>  Example :
>  Returns : TRUE if the object builder wants to populate the slot, and
>            FALSE otherwise.
>  Args    : the name of the slot (a string)
>
>
> =cut
>
> =head2 add_slot_value
>
>  Title   : add_slot_value
>  Usage   :
>  Function: Adds one or more values to the specified slot of the object
>            to be built.
>
>            Naming the slot is the same as for want_slot().
>
>            The object builder may further filter the content to be
>            set, or even completely ignore the request.
>
>            If this method reports failure, the caller should not add
>            more values to the same slot. In addition, the caller may
>            find it appropriate to abandon the object being built
>            altogether.
>
>  Example :
>  Returns : TRUE on success, and FALSE otherwise
>  Args    : the name of the slot (a string)
>            parameters determining the value to be set
>
>
> =cut
>
> =head2 want_object
>
>  Title   : want_object
>  Usage   :
>  Function: Whether or not the object builder is still interested in
>            continuing with the object being built.
>
>            If this method returns FALSE, the caller should not add any
>            more values to slots, or otherwise risks that the builder
>            throws an exception. In addition, make_object() is likely
>            to return undef after this method returned FALSE.
>
>  Example :
>  Returns : TRUE if the object builder wants to continue building
>            the present object, and FALSE otherwise.
>  Args    : none
>
>
> =cut
>
> =head2 make_object
>
>  Title   : make_object
>  Usage   :
>  Function: Get the built object.
>
>            This method is allowed to return undef if no value has ever
>            been added since the last call to make_object(), or if
>            want_object() returned FALSE (or would have returned FALSE)
>            before calling this method.
>
>  Example :
>  Returns : the object that was built
>  Args    : none
>
>
> =cut
>
> What do people think?
>
> 	-hilmar
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp@gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> On Tue, 8 Oct 2002, Hilmar Lapp wrote:
>
> > I'm trying to pull the daily full RefSeq cumulative update through bioperl. Before even getting my hands dirty, I realized that this can't work because there are full chromosomes in there, and their sequences will choke perl. OTOH, I'm not interested in those anyway and ideally I can just skip over sequences some property of which match some pattern.
> >
> > Like always, there is more than one way to make this work, and I'm wondering what could be the (subjectively :) 'best' way in the absence of event-based parsing. Some options that crossed my mind:
> >
> > a) pass an optional additional parameter to next_seq() which is a closure returning TRUE if the entry is to be parsed and returned and FALSE otherwise. For this option the questions would be, when to call this function (every line, every 'item', before feature table, before sequence, any combination of those?), and what to pass to the closure as argument (a hash map with properties? an instantiated Bio::SeqI object? the current line? the current slot that was parsed and its value? something else?).
> >
> > b) create a SeqFilterI interface and pass an object implementing it. This is really just a more OO-form of a) and the same kind of questions need to be answered.
> >
> > c) sending events to an event listener, and skipping over the sequence if any of the listeners returns FALSE (i.e., join by AND). This is again very similar to a) but more flexible but also more heavy-weight (more method calls). Again, similar kinds of questions would need to be answered in order to define SeqParseEventI or a similar interface.
> >
> > I'd be glad to hear anyone's thoughts on this. Also, I'm sure there are better ways. If you know one, I'd be glad to learn.
> >
> > My preference goes for simplicity, and so far I don't think a) is that bad, although it does lack some flexibility.
> >
> > 	-hilmar
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>