[Bioperl-l] Splits again

Thu Jun 28 08:23:10 UTC 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sendu Bala wrote:
> Chris Fields wrote:
>> On Jun 27, 2007, at 5:43 PM, Sendu Bala wrote:
>>> What advantage is there of these defined splits instead of 
>>> individual modules? As I see it you lose some of the potential 
>>> benefits of breaking Bioperl up completely, whilst also suffering 
>>> the maintenance problems I outlined in my objection to Steve's post.
>>>
>>> Being able to work on all Bioperl from a single cvs (ne svn) check 
>>> out/ archive, whilst distributing it as individual modules on CPAN 
>>> seems like the best of both worlds to me. What am I missing?
>>
>> Okay, forewarned, but here's my long-winded reasoning.  The short and 
>> sweet version: I (very) respectfully don't agree with you, at least 
>> re: the idea we should commit all modules to CPAN independently. It 
>> doesn't make any sense to me, but maybe you can elaborate more?  
>> Maybe I'm misinterpreting what you mean?
> 
> The short and sweet version: my proposal has all the benefits of yours,
> but none of the disadvantages. What's not to like?
> 
> 
>> Finally, all of this should wait until later.  Much later, like after 
>> a decent release, after svn, etc kind of 'later'.  I think we can 
>> agree on that.
> 
> Hmm, not really. If it can be implemented by a change in just Build.PL
> and ModuleBuildBioperl, its really independent of everything else.
> That's the beauty of it: the only thing that changes is how things are
> uploaded to and downloaded from CPAN. The only person that normally
> deals with that issue is the pumpkin for a release, and he only cares
> about it at release time.
> 
> In fact, if we're going to do it at all it makes sense to try it out on
> a minor release like 1.5.3. We've already got experience of doing it
> split-style from 1.5.2. (And let me tell you: splits at the code-base
> level suck.)
> 
> 
>> Individual CPAN modules:
>>
>> CPAN is not our personal versioning system; it may be if a 
>> distribution consists of only a few modules, but not when it's one of 
>> the largest distros present.  If someone wants to update an 
>> individual bioperl module for a quick bug fix they are more than 
>> welcome to download it via cvs, svn, or even using a web browser, and 
>> replace the one they have.
> 
> And where is the harm in letting them do it via CPAN as well? In fact,
> there are significant benefits:
> 
> 
>> I'm trying to reason how one could break up the individual SeqIO/
>> SearchIO/otherIO modules into single module distributions.  They are 
>> intrinsically tied together (SeqIO::genbank won't work w/o SeqIO, 
>> which relies on the various interfaces, RootIO, and on down).  How 
>> would tests be run off CPAN when the modules are distributed 
>> independently?
> 
> Bio::SeqIO::genbank would have a dependency on the latest version of
> Bio::SeqIO (etc.), and Bio::SeqIO would have its own dependencies.
> 
> So when a user wants to get the latest version of Bio::SeqIO::genbank,
> they no longer have to worry about what other modules in its dependency
> hierarchy they should also install.
> 
> Instead they just request Bio::SeqIO::genbank which itself ensures you
> have the latest version of all its dependencies before installing itself
> and running its tests.

This was my thinking when I first brought this up at the
begining/splitting of this thread. This way of thinking of modules as
the constituent parts of a larger package should make it easier for
people to define dependencies far easier as well as users only needing
to install those parts they require. As Sendu points out, if the user
wants to convert seqs from genbank to fasta they could simply install
Bio::SeqIO::genbank and Bio::SeqIO::fasta and they would get all the
other modules that are the dependencies of Bio::SeqIO::genbank and
Bio::SeqIO::fasta.

> 
> When a dev makes a major bugfix to Bio::SeqIO::genbank that all genbank
> users should have, he could just call './Build dist Bio::SeqIO::genbank'
> which would generate a new package for Bio::SeqIO::genbank suitable for
> uploading to CPAN. No more long release cycles and having to constantly
> tell people to 'use CVS' to get working Bioperl code.

However, how would the test suite work out with this? e.g. when someone
installs Bio::SeqIO::genbank they want to have the tests associated with
Bio::SeqIO::genbank to be run. Would there be tests that would be run
redundantly if for example someone installed Bio::SeqIO::genbank and
Bio::SeqIO::fasta?

> 
> 
>> Would they also be individually distributed?  What  would you use to
>> tie all the individual modules together?  How would  you explain to
>> the CPAN maintainers that you want to split bioperl  into 990
>> individual modules, all updated independently, but intend on  bundling
>> them afterwards anyway?
> 
> They would be tied together by a CPAN bundle. You don't have to
> 'explain' anything to the CPAN maintainers because you're not doing
> anything wrong. In fact, you're using it the way you're supposed to.

Yep. real modules are released as modules, each with their own set of
dependencies. The use CPAN bundles the way there were supposed to be for
- - distributing a set of CPAN modules that make a coherent set of
functionality. You "could" also bundle in other authors modules e.g.
Bio::ASN1::EntrezGene?

> 
> 
>> Splitting up core:
>>
>> As I see it, here are the advantages of a defined split as Steve and 
>> I see it (off the top of my head).  Some of this probably reiterates 
>> my previous points, as well as Steve's, so apologies in advance.
> 
> Below I answer with how it would be with my single-module approach
> compared to the defined splits.
> 
> 
>> - A lean, mean, focused set of bioperl base modules (core) w/o or 
>> with very few external deps, minimal installation issues, etc.  The 
>> very basic stuff to get up and running.
> 
> Even leaner, even more focused.
> 
> 
>> - BioPerl bundled modules (Nathan's 'cliques') with defined, focused 
>> functionality, code, and tests, which add a bit more 'sugar' to the 
>> base functionality of the core.  If you only care about parsing BLAST 
>> reports, get SearchIO, which requires core and optionally other 
>> modules (XML::SAX).  If you want additional DB functionality apart 
>> from the very basic ones in core, install DB (with it's additional 
>> requirements, including core, DBI, and so on).  Same with Graphics, 
>> Tools, Tree/Phylo, etc.  We just need to define and limit the number 
>> of splits.
> 
> The same can be achieved with CPAN bundles for each kind of functional
> grouping you can think of. And since its just a single text file that
> defines such a grouping, its easy to change or add new ones as you feel
> like it, as opposed to the rather more permanent and substantial effort
> of creating one of your splits on the code-base level.
> 
> Also, the world doesn't have to rely on /our/ ideas of what a useful
> functional split is. If someone just wants to parse Blast results, they
> can just use CPAN to install Bio::SearchIO::blast_pull instead of having
> to install all of SearchIO.
> 
> 
>> - Easier to add additional bundled modules.  For instance, I could 
>> focus all of my RNA work into a discrete set of modules (say, bioperl-
>> rna) which I maintain, I ensure works with the latest core code, I 
>> ensure also plays well with the other children =) , and I distribute 
>> via CPAN.  Same with EUtilities, which could go into a separated DB-
>> related set or stay in core.
> 
> And if you lose interest in them? They eventually die because they no
> longer have someone looking after them by default (the pumpkin and other
> devs). Alternatively you could just make a CPAN bundle. One text file!
> Easy! No duplication of modules in CPAN, no new hassle for you or the
> Bioperl 'core' pumpkin to ensure that the latest version of each work
> with each other and other splits.

Hmm, how would module versions be handled? Wouldn't this approach
require each module to have it's own independent version number, which
could then be used for building the dependencies? Each new release of
that module would only bump that module's version number.

Bundles can specify the minimum version of a module to be installed,
such that bug fixes to individual modules and be released into CPAN and
would automatically get picked up when installing bundles etc.

I'm not quite sure how the current stable/dev releases would work. I
assume bug fixes would have to be made on a branch e.g. branch 1.6 and
released to cpan from there. Then when the next stable release is made,
all module versions would be bumped and and released to CPAN. With any
modifications to the content of the bundle to be made. Is it possible to
have a stable and developer release bundles that are able to specify the
minimum stable and developer modules versions respectively?

> 
> 
>> - If we want a full-fledged 'install everything', the CPAN Bundle 
>> system is available.  I think it's easier to use a Bundle for 4-5, 
>> even 10 groups of modules as opposed to over 900.
> 
> No, it isn't any easier. Its /equally/ easy to install a bundle of 900
> packages of 900 modules as it is to install 5 packages of 900 modules.
> 
> When not installing absolutely everything, but perhaps 'most' things,
> there's the additional benefit that it would be easier to skip a
> particular Bio::module because you didn't want to install its external
> dependencies and weren't that interested in it anyway.
> 
> 
>> - A Bundle or a build file where discrete distributions are listed 
>> (Bio::SearchIO, etc) wouldn't need to be updated every time a new 
>> module is added to a distribution.  I suppose this could be 
>> automated, but why have the additional headache?
> 
> Yes, it would be automated, and no, it wouldn't at all be any kind of
> additional headache. I'm proposing a fully-automated system that the
> pumpkin wouldn't even have to think about it. Much /less/ of a headache
> than dealing with splits. Orders of magnitude easier to deal with.
> 
> 
>> - A chance to cut out some cruft.  We all know that particular areas 
>> need work or a complete overhaul (Restriction, Structure, maybe a few 
>> others).  Smaller, concentrated sets of modules I believe would be 
>> easier to maintain, and those that don't get use will eventually fall 
>> out of favor and may be lost or replaced from the more maintained 
>> group of modules.  Survival of the fittest.
> 
> And the smallest, most concentrated set of modules is the individual
> module.
> 
> 
>> - We already have had practice; bioperl-db, bioperl-run, bioperl-
>> network, and others.  Those that have been routinely maintained and 
>> enjoy wide use (db, run, network) have survived; others not so much 
>> (corba-related stuff, microarray, ext, etc., though the code is still 
>> available if someone else wants to take it up and revive it!).
> 
> The reason some of these existing splits (micoarray, ext) have fallen by
> the way-side? /Because/ they're splits. If they had been part of
> bioperl-live all along, they'd have been kept in a working, compatible
> state and would have been released along with everything else in 1.5.2
> 
> 
>> Disadvantages of a defined split:
>>
>> - The initial headache of identifying which groups go where, 
>> coordinating with those who rely on bioperl (GMOD, etc) on how this 
>> will be set up, so on...
> 
> No need to worry about this with individual modules.
> 
> 
>> - Separate groups of modules require testing together to ensure 
>> functionality is consistent and maintained (something I think you 
>> pointed out previously).
> 
> No need to worry.

Maye need to worry aout how the tests are run when installing individual
modules etc?

> 
> 
>> - I think an increased possibility of branching is possible.
>>
>> - Extra headaches for devs, who have to keep track of the various 
>> critical distributions and make sure they work well together.
> 
> No headaches.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGg2/uczuW2jkwy2gRAlR4AJ44kHIXWWapNVGOIrkFBJdP9rn3vwCdErhT
VkymyXNshguE44/RilEXWDA=
=O5ex
-----END PGP SIGNATURE-----