[Bioperl-l] Re: UniGene modules and Bio::Cluster

Sun, 5 May 2002 23:34:14 -0400 (EDT)

On Mon, 6 May 2002, Andrew Macgregor wrote:

> Hi Allen,
>
> Allen Day wrote:
>
> >> 1. rename UniGeneIO.pm --> IO.pm
> >
> > By convention of SeqIO, I think it should be called unigene (note
> > lowercase), and should be under ClusterIO.  Have a look at how the
> > SeqIO::game classes are laid out.  So I think it should go something like:
> >
> > Bio/
> >
> > Cluster/
> > ...
> > ClusterIO.pm
> > ClusterIO/
> > unigene.pm
> > ungene/
> > ...
>
> I'll put it in  lowercase (like the format module is) the mixed case has
> been used because it is called UniGene by NCBI.

This reason we made it lowercase was for consistency so that we could load
the appropriate format module in (see the eval { require ... } in
SeqIO/TreeIO/SearchIO classes).

>
> It does seem looking around bioperl that the IO.pm method of Variation is
> less common than putting a ClusterIO.pm at the top level then format modules
> in a directory like SeqIO.pm and SeqIO. If I change the UniGeneIO.pm to be
> ClusterIO.pm at the top level I'll just need the format module in the
> ClusterIO directory.
>
The difference in styles is due to different preferences by the various
developers - I have a slight preference of ClusterIO over Cluster::IO but
there really is no difference.

> So then for other clusters only a format module should need to be written,
> is that right? My layout then would be:
>
> Bio/
> |   ClusterIO.pm
> |   ClusterIO/
>     |   unigene.pm
>
> |   Cluster/
> ...
>
> >I'd like a next_component() method that can be
> Can you show  me how that would be done? Are there examples of similar
> things elsewhere in bioperl?
>
Hang on - ClusterIO is the parser.  It is only going to have 1 or
2 methods - next_cluster() and possibly write_cluster() which produce
or consume respectively, Bio::Cluster::ClusterI compliant objects.

I realize it is hard to properly abstract out the generic methods from a
cluster object because you don't have a second cluster system to compare
to.   But I would imagine that ClusterI would have the following methods

primary_id()    unique id number
name()  unique name
overall_description() the common description
length() length of the cluster

At some point when we get the assembly objects going we will probably want
the clusterI to inherit from some part of the Assembly objects since
these clusters represent an assembly of sorts - so some methods
for accessing where overlaps start/end, etc would need to get put in.

The ClusterI object would inherit from Bio::Annotation::CollectionI so
that you could attatch bibliographic and generic annotation to the them
(like the list of all the annotations for the individual cDNAs, ESTs,
etc).

each_seq_name() - names of all the sequences in the cluster
each_seq_acc()  - accessions of all the sequences in the cluster

I'm sure there are more things, this is just off the top of my head w/o
looking at an unigene file very closely.  Specific things like unigene_id
would be part of a specific Bio::Cluster::Unigene object, but the Unigene
object might alias the primary_id() method to unigene_id().

Hmm, I wonder if we should qualify some of these methods as Sequence
clusters since gene clusters would be coming out of expression data
clustering....  That would mean creating a Bio::Cluster::SeqClusterI for
the methods dealing with length and all the assembly stuff.  This might be
trying to mix too much - but I don't want to write ourselves into a corner
either.

-j

> Cheers, Andrew.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu