[Bioperl-l] Tree refactor? was Re: Bootstrap, root, reroot...

Thu Jul 16 05:44:40 UTC 2009

(Warning, longish response, I'll probably add to my blog at some point)

Robert,

I agree with you, but we've had this discussion before. Repeatedly,  
actually. I have a page in the wiki dedicated to it, having first  
raised the issue myself:

http://www.bioperl.org/wiki/Proposed_core_modules_changes

It also has a mention on the Core page:

http://www.bioperl.org/wiki/Core_package

In fact, I was planning on writing up a blog entry this week on this  
very thing to get the ball rolling again, but it probably should go  
here first anyway...

First: the problem we have consistently run into is exactly how to  
deliver a core set of modules in a way that works both for users and  
for release managers.   We have settled on one of the original  
proposals noted above, starting by roughly splitting up the current  
'core' into something based on similar functions and level of  
development/support.  bioperl-dev was part of that, for instance, and  
represents code we consider 'developer-only' or experimental.  The  
true 'core' would be a base set of modules with minimal additional  
dependencies (see below for how nebulous this becomes).

If you haven't already noticed, prior to 1.6.0 Bio::Graphics basically  
started the process (it's now an independent release on CPAN) and we  
already have a bioperl-dev.  As you've noted we can't split everything  
up right from the beginning, but we have started down that path.

Second: Bio::Tree seems independent of the other modules, but that's  
highly misleading. Bio::Species and Bio::Taxon (and thus anything that  
will use said objects, like Bio::Seqs, which are very much core) are  
now completely dependent on Bio::Tree code.  Both are-a  
Bio::Tree::NodeI, I believe since 1.5.2.  If we split that code off it  
then creates a circular dependency (Bio::Species, in core, requires  
Bio::Tree in the bio-tree package, which in turn requires  
Bio::Root::Root in the core package).  Bio::Tree code also has a  
Bio::DB::Taxonomy, thus expanding core a little bit more.  Similarly,  
Bio::Ontology classes are used by several key modules  
(Bio::Annotation::OntologyTerm comes to mind, but also  
Bio::Annotation::OntologyTerm).  In other words, there are some parts  
of core that can't easily be split off w/o repercussions (and thus  
probably won't be).

Third: the largest issue in my opinion, that being what really  
constitutes 'core', not just to us but to current bioperl users.  To  
me, the idea or a true 'core' is the bare essentials (Seq, Features,  
Annotations, and some basic IO modules, the most common interfaces).   
Should 'core' include SearchIO, or AlignIO?  Remote and/or local DB  
functionality?  Bio::Tools?  All of those are feasibly independent  
sets of modules, and I would definitely support those being in their  
own subdistributions and would be easier to fix bugs and release  
updates, but I may be in the minority as they are extremely popular,  
and many users still consider them 'core'.  We need need a workaround  
for that.

Finally (a wrap-up of bits and pieces): a) how are the various bio-*  
packages to be maintained?  Would there be several release pumpkins,  
one for each release?  b) How do we sort out versioning?  For  
instance, would bio-foo have a separate version (like Bio::Graphics  
now does) and require a specific core version?  c) I'm sure I have  
forgotten a few things, but I've rambled on enough already.

</breather>

Now, my suggestions.  We have settled on a general layout, so...

* Each subdistribution would have a separate version and require a  
specific core (Bio::Root::Root) version.  Note that Bio::Graphics is  
using a different versioning scheme than BioPerl, but we may want to  
stick to a similar tripartite numbering scheme as for core.  Whatever  
happens, this must be decided on first, as there will be no turning  
back.
* We repurpose Bundle::BioPerl (or a similar Bundle::* package) or  
make the BioPerl distribution itself a bundle-like installation.  This  
would be for packaging up an old-style 'everything and the kitchen  
sink' core package from the various distributions.  Anytime we split  
off something into it's own distribution we release a newly trimmed- 
down core and add the new distribution to the bundle or BioPerl.   
Refer everyone to install the bundle if they want the old-style  
installation.
* Other current subdistributions (run, db, network, etc) follow the  
same pattern as the above.  Releases for non-core distributions do not  
have to be tied together with core except where needed.
* Avoid any circular dependencies (Bio::ASN1::EntrezGene, I'm staring  
at you).
* As you mention, work these out on branches to test things out.

And finally, and I am saying this with the utmost respect and  
sincerest thanks for everything Sendu is doing and has done for  
BioPerl, but I'm not convinced we should keep using Bio::Root::Build.  
It does make some things convenient, but at the cost of additional  
bugs (2-3 at last count), some API breakage (some methods conflict  
with Module::Build), and a bit of a chicken-and-egg dilemma that  
particularly impacts subdistributions (attempting to fall back to  
Module::Build doesn't work due to API issues).  I can elaborate on  
that more if asked, but I think this post is already long enough, so  
I'll leave that to later.

chris

On Jul 15, 2009, at 8:05 PM, Robert Buels wrote:

> Rather than putting this in bioperl-dev, perhaps this would be a  
> nice opportunity to make a new distribution called something  
> standard like "Bio-Tree", with a standard directory structure, and a  
> sane number of modules in it.
>
> I hadn't planned to start an actual battle about this yet, but I  
> would just like to get it out there that the current 'huge  
> monolithic distributions' model of BioPerl is completely insane.   
> Talking to people about BioPerl at YAPC::NA last month, I saw that  
> this is quite puzzling to the wider Perl community.  I was going to  
> say it was a laughingstock, but that's not actually the case.  They  
> are mostly puzzled and strongly suspect that it's not right.  Well,  
> the diplomatic ones do, anyway. Matt Trout (of DBIx::Class and  
> Catalyst fame) would probably yell and curse about it in a very  
> entertaining way.
>
> If things were in smaller distributions, making and testing releases  
> would be a lot easier, because the pieces of code you're testing and  
> releasing are smaller, and the dependencies among the pieces are  
> characterized, codified, and enforced via the Build.PL files of each  
> distribution.
>
> There, I said it.
>
> But aside from my inflammatory remarks above, this sort of thing  
> need not happen all at once.  The "Bio-Tree" distribution is a nice  
> example of how things could be extracted from or begun outside the  
> bioperl-* distributions, with the bioperl-* monolithic balls of mud  
> getting smaller as things are moved from them into their own  
> distributions. This needs to be done carefully, but  so things like  
> this should probably be done only with major releases, and with lots  
> of notifications and release notes and things like that.
>
> OK, now that I've said "this sucks and needs to change", I now go on  
> to volunteer to do work to make it happen.  I will take and execute  
> orders from you core developers saying things like "make a branch,  
> take this list of modules, copy them into a new distribution, move  
> their tests over, and write a Build.PL with the correct  
> dependencies", and later "merge the moved_thing_somewhere" branch  
> into the some_other_branch and test it".  I bet somebody whose name  
> rhymes with "Jay Hannah" would probably do grunt work to help with  
> this also, but of course he would have to volunteer first.  ;-)  I  
> also volunteer to help teach others how to do this, but they have to  
> figure out how to use IRC.
>
> Oh, and I also volunteer to keep writing inflammatory emails.
>
> Rob
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l