[Bioperl-l] bioperl reorganization

Fri Jul 17 21:23:01 UTC 2009

I was going to write a longer post, but Jay wrote everything I was going 
to write, plus more, and did a better job.

Jason Stajich wrote:
> For some other obvious modules that can be split off and self-contained, 
> each of these could be a package.  I would estimate more than 20 
> packages depending on how Bio::Tools are carved up.
>  - I think Bio::DB::SeqFeature needs to be split off for sure this is a 
> nice logical peeling off.  Could be another test case since it is a 
> Gbrowse dependancy.
>  -  Bio::DB::GFF as well for the same reasons.
>  -  Bio::PopGen - self contained for the most part, but depends on 
> Bio::Tree and Bio::Align objects
>  -  Bio::Variation
>  -  Bio::Map and Bio::MapIO
>  -  Bio::Cluster and Bio::ClusterIO
>  -  Bio::Assembly
>  - Bio::Coordinate

Oh, this is a nice list.  <saves it>

> What do you want to do about the bioperl-run.  Do we make a set of 
> parallel splits from all of these?  I think at the outset we need to 
> coordinate the applications supported here in some sort of loose 
> ontology - the namespaces were not consistently applied so we have some 
> alignment tools in different directories, etc.  So the namespace sort of 
> classifies them but it could be better.  One of the challenges of 
> multiple developers without a totally shared vision on how it should be 
> done.

I would say that all alignment tools (for example) should probably not 
all go into the same distribution.  For example if Alice wrote some 
alignment thing and Bob wrote some other thing, but they're not really 
related beyond the fact that they do similar things and possibly depend 
on similar things, they should go in separate distributions.

> I'm not convinced that the Bio::Graphics splitoff has been painless so 
> we should take stock of how that is working.

Yes, lets.  I would like to hear more about that.

> It seems like this split off would be a way to better streamline things 
> in bioperl so that modern versions of bioperl might be able to better 
> interface with things like Ensembl again too.

Once things are less monolithic, developing and releasing *should* be a 
LOT easier.  As Jay also mentioned a bit, it's more like on Tuesday 
Charlie notices a bug in Bio::Foo::Bar, fixes it.  Pushes it to CPAN 
(with a small version bump) immediately afterward.  Users pick it up via 
Task::BioPerl.  That's it.

Or, how about a slightly longer case study:
Say on Wednesday Charlie notices that the design of Bio::Foo::Bar sucks 
and it really needs some work.  He codes furiously for however long it 
takes, makes Bio::Fooer::Bar or something like that, in a new 
distribution, and pushes it to CPAN.  Initially, no other modules are 
going to be using it, but then say Jason, the maintainer of 
Bio::SeqIO::fasta, notices that hey, Bio::Fooer::Bar is a lot better 
than Bio::Foo::Bar.  Then he can just use it, test his new 
Bio::SeqIO::fasta with it, put it in his dist's Build.PL as a 
dependency, and push to CPAN.  Now it's getting pulled in with 
Task::BioPerl and *USERS* now have been given that improvement, probably 
in only a matter of days.  There are automated tests at every step of 
the process to ensure quality throughout.

Or for larger changes, coordination among several distros may be 
necessary, but the nice thing is, exactly which ones those are is 
codified in all their Build.PL files!  Much less guessing and worrying 
about unintended consequences.  Things are abstracted into smaller 
chunks, which are much easier for developers to wrap their minds around, 
which means developing is easier, which leads to more contributors and 
accelerated development.

> How much of this effort is worth triaging on the current code versus the 
> efforts we want to make on a cleaner, simpler bioperl system that 
> appears to scare so many users (and potential developers) off.

If there were not so many person-years of development time already in 
BioPerl, I would probably be pushing for ground-up rewrite to simplify 
things.  But as chromatic frequently says (he's fantastic, look him up), 
ground-up rewrites of large projects almost never work.  You lose a year 
(or multiple years) of person time rewriting instead of adding features, 
or if you also add features to the old version in parallel, you have to 
also port those features to the new version (over a really long time 
period).  It's theoretically possible to do, but in practice it almost 
never works, he says.  I don't know, I've never been involved in an 
attempt like that from start to finish.

> Okay I rambled, hope that was helpful.

Quite helpful!  Please keep it up if you can!

Rob