[Bioperl-l] Unigene proposal and basic implementation

Jason Stajich jason@cgt.mc.duke.edu
Sun, 21 Apr 2002 23:50:22 -0400 (EDT)


Cc-ing the list in case I am mistating anything or others have input

On Mon, 22 Apr 2002, Andrew Macgregor wrote:

> Should this be UniGeneI.pm with functions like this:
>
> sub unigene_id{
>    my ($self) = @_;
>    $self->throw_not_implemented();
> }
>
Yeah that is correct - the question is always to what extent are you
writing methods in an interface that are exclusive for a certain
implementation.  This will probably come with time and tweaking a little -
I'm not 100% sure I remember all the nuances of your design, we can always
add/remove methods from the UnigeneI interface so go ahead and put all of
them in there, even if it seems silly, and we can discuss it in more
detail as the code unfolds.  Examples of probably similar decisions are
the Location objects where we wrote a Location::SplitLocationI object
which is only implemented by the Location::Split object.  However, it is
concievable that someone might want to reimplement this in their own
system so separating the interface is good.  It is also concievable that
someone else will dream up their own UnigeneI interface (that might reside
in a database rather than in-memory as your implementation will probably
be), and having the interface allows the user code to treat both objects
as the same.

> for each function? Or should it be ClusterI and if so what should I put in
> that? I can't see how ClusterI could be kept generic because of all the
> unigene specific stuff that would have to be in it.
>

Hmm - you should design ClusterI s.t. we can write a TIGRClusterIndexI or
a GenericESTClusterI interface as well and not have to know anything about
the UnigeneI.

Going back to your proposed interface - let's try and abstract out what
the basics of a cluster is - it will be a collection of sequences
(or sequence references -- accessions) which will likely share a common
set of annotations, anything else?

The UniGene part of the cluster is going to be additional annotation like
genes, expressed tissues (via libs), mapped genome and cyto location,
mapped STSs, protein similarity, etc.  So you might do better to use a
general annotation system of key/value pairs for the ClusterI - perhaps
even the AnnotationCollectionI interface which would allow you to do set
'Gene' and 'ExpressedLib' keys with associated values.  If you wanted to
do fancy things with sequences (like retrieve them on the fly using
Bio::DB::GenBank) that could get implemented in the UniGene.pm module.

This is just my first reaction without too much thinking about it - so
perhaps I'm oversimplifying.  Not sure if anyone else has tackled a
general representation of clusters like this in other projects.  If they
have, I'm sure someone will tell me where I'm being short sighted.  Just
an aside - we might want to be more specific with the ClusterI name - like
Seq::ClusterI so that we don't confuse people with the potential
GeneExpression Expression::ClusterI one day.  But that can be up for
debate if others want to throw it in the root of Bio::.

But go ahead and do what you can and send out some specific methods and
concepts that you having difficulty thinking how they can be generalized
and we can all feed back our opinions.  You get to make the choice on what
you think is best and make sense for what you need done in the long run.
So if we're calling for too much generality for your liking we can always
retool things later if/when someone wants to interface with the
TIGRGeneIndexClusterI.

> > b) you have a good set of tests written for your objects, tests go in the
>
> No problems here, only what kind of tests should I have? Just ones that test
> new() and then each function i.e. next_unigene()? I've looked at the various
> *.t files but am not sure how detailed I should be etc.
>

Your call - the more tests the better as it makes it so that you have
basically tested every function.  Extreme programming methodology
encourages you to write your tests before you finish writing your module
and to write a test for each method, your call again, but the more
thorough, the better the end product will be IMHO.  You can of course go
overboard, but no one's been flamed on this list for having too many tests
for a module...  Of course I just expanded the SearchIO.t to 539 tests...

> Cheers, Andrew.
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu