[Bioperl-l] Speed bumps

Tue, 26 Nov 2002 11:59:14 -0800

Hi Folks

Ewan asked about the speed bumps I see in the newbie learning process, so
here goes.

The speed bumps I encountered fall mostly into the “embarrassment of riches”
category.  BioPerl is a very large system, and it has taken a lot of time to
get my head around it (a process that is far from complete).

The main solutions are pretty obvious.  (BTW, I’m happy to help with these,
though not right away as I have a big grant deadline Feb 1).

1.  Documentation.  BioPerl needs a User Guide to complement the excellent
tutorials that are available and the voluminous “reference manual” provided
by the POD-based docs.

2.  More working applications that go beyond example scripts and show how to
use the software to do real work.  In my case, Lincoln’s
process_ncbi_to_gff.pl was very helpful in getting me over the hump with
GFF.

Another fairly obvious way to help people get started is to provide wrappers
for more tools and databases.  The very first time I tried to use BioPerl, I
was stymied by the lack of a wrapper for FASTA.  This is a never ending
task, of course.

I also encountered some problems with remote database access over the web.
The problem, as noted recently on the mailing list, was due to the vagaries
of the web and the remote databases.  But, as this was one of my first
BioPerl experiences, the tendency was to blame BioPerl.  Better diagnostics
would have helped.  Perhaps the system could spit out progress messages –
“connecting to remote resource foobar…” – and eventually timeout with a
sensible message.

I lost a fair bit of time exploring the many parts of the system that deal
with sequence features and related concepts.  SeqFeature and its partner,
Location, are the official homes for these concepts, but some of Location’s
basic capabilities are provided by Range, and more advanced capabilities are
added by Coordinate. DB::GFF also provides feature-like concepts including
some of the advanced capabilities in Coordinate. Gene-related features are
supported  by Gene and friends in the SeqFeature hierarchy, a separate group
of classes that are part of the LiveSeq family, and GeneMapper in the
Coordinate hieracrchy.

When I set out to use features for the first time, I had to explore all of
these classes to figure out which ones to use for my particular purpose.  I
ended up writing and trashing a lot of code before settling on a reasonable
compromise.

I found the installation process to be quite smooth.  I’ve installed the
software on one system where I have root access and two where I don’t.  The
only problem I had on the root system was getting user accounts set up in
MySQL: the MySQL installation guide has incorrect information on how to do
this, although the main reference docs explain it quite clearly.  I can’t
remember if there were any problems on the non-root systems.  (MySQL was
already installed on those systems.)  There was certainly no trouble with
any of the Perl code – I should point out that I already had my environment
set up to look for private Perl libraries, so I didn’t have to fiddle with
environment variables which may be a problem for some users.  I can’t
remember what I did to get the executables for the plug-in tools in the
right place, or if this was a problem.

A final, and probably more controversial, issue is that the software design
has tilted quite far in the direction of modularity at the expense of
integration.  Modularity makes life easier for people who are developing the
library, while integration makes life easier for people who are using it.
To illustrate the tradeoff, consider a design in which _all_ sequence
methods were defined in SeqI, rather than being introduced incrementally as
you walk down the sequence class hierarchy.  In this design, a person
working with sequences could see all methods in one place.  And if SeqI or a
companion Seq class provided sensible default implementations for all
methods, a person could safely use all methods (at least all retrieval
methods) without having to agonize over exactly what kind of sequence object
he had in hand.

SeqIO and AlignIO are examples of classes that are quite easy to use.
Making all of BioPerl this easy would be a significant challenge, but one
you may want to undertake as you become satisfied with the core
functionality.

Best,
Nat