[Bioperl-l] sequence filtering

Hilmar Lapp lapp@gnf.org
Tue, 8 Oct 2002 17:13:05 -0700 (PDT)


Apparently (unfortunately) it didn't ring a lot of bells for many people.  
I'm still looking forward how Biojava does this exactly, even though I've
now looked through some of their interfaces.

It seems to me what they do is not terribly different from the following
interface Bio::Factory::ObjectBuilderI that I propose as a solution. There
would be an implementation Bio::Seq::SeqBuilder.

=head2 want_slot

 Title   : want_slot
 Usage   :
 Function: Whether or not the object builder wants to populate the
           specified slot of the object to be built.

           The slot can be specified either as the name of the
           respective method, or the initialization parameter that
           would be otherwise passed to new() of the object to be
           built.

 Example :
 Returns : TRUE if the object builder wants to populate the slot, and
           FALSE otherwise.
 Args    : the name of the slot (a string)


=cut

=head2 add_slot_value

 Title   : add_slot_value
 Usage   :
 Function: Adds one or more values to the specified slot of the object
           to be built.

           Naming the slot is the same as for want_slot().

           The object builder may further filter the content to be
           set, or even completely ignore the request.

           If this method reports failure, the caller should not add
           more values to the same slot. In addition, the caller may
           find it appropriate to abandon the object being built
           altogether.

 Example :
 Returns : TRUE on success, and FALSE otherwise
 Args    : the name of the slot (a string)
           parameters determining the value to be set


=cut

=head2 want_object

 Title   : want_object
 Usage   :
 Function: Whether or not the object builder is still interested in
           continuing with the object being built.

           If this method returns FALSE, the caller should not add any
           more values to slots, or otherwise risks that the builder
           throws an exception. In addition, make_object() is likely
           to return undef after this method returned FALSE.

 Example :
 Returns : TRUE if the object builder wants to continue building
           the present object, and FALSE otherwise.
 Args    : none


=cut

=head2 make_object

 Title   : make_object
 Usage   :
 Function: Get the built object.

           This method is allowed to return undef if no value has ever
           been added since the last call to make_object(), or if
           want_object() returned FALSE (or would have returned FALSE)
           before calling this method.

 Example :
 Returns : the object that was built
 Args    : none


=cut

What do people think?

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

On Tue, 8 Oct 2002, Hilmar Lapp wrote:

> I'm trying to pull the daily full RefSeq cumulative update through bioperl. Before even getting my hands dirty, I realized that this can't work because there are full chromosomes in there, and their sequences will choke perl. OTOH, I'm not interested in those anyway and ideally I can just skip over sequences some property of which match some pattern.
> 
> Like always, there is more than one way to make this work, and I'm wondering what could be the (subjectively :) 'best' way in the absence of event-based parsing. Some options that crossed my mind:
> 
> a) pass an optional additional parameter to next_seq() which is a closure returning TRUE if the entry is to be parsed and returned and FALSE otherwise. For this option the questions would be, when to call this function (every line, every 'item', before feature table, before sequence, any combination of those?), and what to pass to the closure as argument (a hash map with properties? an instantiated Bio::SeqI object? the current line? the current slot that was parsed and its value? something else?).
> 
> b) create a SeqFilterI interface and pass an object implementing it. This is really just a more OO-form of a) and the same kind of questions need to be answered.
> 
> c) sending events to an event listener, and skipping over the sequence if any of the listeners returns FALSE (i.e., join by AND). This is again very similar to a) but more flexible but also more heavy-weight (more method calls). Again, similar kinds of questions would need to be answered in order to define SeqParseEventI or a similar interface.
> 
> I'd be glad to hear anyone's thoughts on this. Also, I'm sure there are better ways. If you know one, I'd be glad to learn.
> 
> My preference goes for simplicity, and so far I don't think a) is that bad, although it does lack some flexibility.
> 
> 	-hilmar
>