Notes Seq.pm Aln.pm

Mon, 17 Feb 1997 14:39:32 +0900 ()

[Note, these were copy-and-pasted from a wordprocessor, as I wrote the
note at home.  Some charaters, like apostrophes, didn't copy right.  I
hope no information is lost from this.]

Wow!  I cant believe how far this code has come since the drafts I read on
airplanes in December.  Both the Bio::Seq and Bio::Aln modules were really
a joy to read though.  The code was typically clear, well-documented, and
effectively implemented.  I cant thank you guys enough for doing such a
great job.

I spent most of my time on Bio::Seq, and will address that first.  As I
said, the code looks really great.  I found  small number of bugs and have
a couple of (hopefully) small design issues.  Once they are resolved, I
think that the code can be packaged and sent to CPAN as a public beta.

Most of the comments are written on paper and I dont have time to
transcribe them.  Who should I mail them to (and should I send a photocopy
to anyone else)?  Please give me a postal address.

The main nits were:
  [ \t] seem to be sometimes used as the opposite of \S when \s would be a
better option.
  Some of the package globals use hashes rather than the arrays that they
should
  There needs to be a validly-check bit somewhere in the code; is this
object valid or could it be bezerk?
  The alphabet checking idea and implementation both seem to have problems
  No mechanism of undefining values was provided
  Several functions return strings rather than Bio::Seq objects
  The layout functions should be autoloaded.  This can wait for later
releases, though.
  The numbering is limited to numerics which increment uniformly
  The parsing code seems to be a bit schizophrenic about strings versus
filenames

All of these issues can be dealt with pretty neatly, I think, except the
last one.  The actual code for doing the parsing was quite nice.  However,
there was the weird and poorly documented weird behavior regarding passing
around both $ent and $filename.  This seemed like duplication and a good
way for code to break in the future.  Moreover, right now the code is
quite inefficient in the way it reads in the entry (doing intrinsic
parsing) and then later tries to do the "real" parsing.  In addition to
causing a potentially problematic time delay, it can also use enormous
amount of memory if large files (e.g., a genome of 4.5Mb) were to be read
in.

I realized that this was the legacy of the way I had originally designed
the code and that you had simply tried to work-around my approach in a
general way.  Moreover, this half of one, half the other 
I think that now is the time to reconsider this approach and, perhaps, to
redesign (without, hopefully, too much re-coding) the whole file input.

On reconsideration, I think that it would be best for the object to take
an entirely file-based view of parsing data formats.  That is, all parsing
will rely on the data coming from a file.  If the data isnt in a file, it
will be written to a temporary file.  The only exceptions (at least
initially) would be raw and fasta, which would be parsed by entirely
different functions, separate from the ordinary parsing functions.

I think that while redesigning the code to approach things this way, we
should also address the issue of parsing (and writing) multiple sequences.
This requires some sort of Bio::Seq::Set object which simply contains a
set of sequences, probably keeping them in order (so it isnt really a set,
but a list).  Both Bio::Seq::Set and Bio::Seq could draw on the same
parsing routines, perhaps in Bio::Seq::Parse.   (This also opens a can of
worms, because some files with indices are expensive to open and close, so
it is worth trying to keep them open all the time if they will be
repeatedly accessed.  I have code to deal with such files, but that could
always be folded in later without affecting the interface to Bio::Seq.

In order to expedite matters, I suggest that Bio::Seq::Set be given an
entirely minimal interface initially, probably just the ability 1) to
produce a list of all included Bio::Seqs, NOT necessarily in order and 2)
to return a Bio::Seq with a requested id.

What do people think of this idea?

-=-=-

Bio::Seq::Aln was also very enjoyable to read, though I was able to spend
less time going through it in detail.  It seems simple yet powerful and
should be a very useful tool.   It does, however, focus very strongly on
the aspects of the alignment object which Georg required for his nucleic
acid alignments and essentially omits entirely those features which were
crucial for typical protein alignments.

Some of the missing features are:
  It has no support for non-numeric and non-contiguous numbering
  There is no linkage between the original sequence and the alignments,
especially for numeration.  (For example, there is no way to retrieve the
column corresponding to the 5th residue of protein foo_bar)
  It is inconvenient and inefficent at performing typical protein
alignment operations, such as extracting a column as a string
  No mechanism for dealing with structural alignments

Moreover, it seems that the internal design of the code makes
implementation of these features extremely difficult.  Lest there be any
doubt, I should make it clear that I dont blame Georg for this; I think
that it would have been hard, if not impossible, to reconcile the
different requirements.  That he only implemented the features he really
requires is also entirely reasonable and understandable; that he
implemented so much in the first place is totally commendable.

However, I have to ask if a module whose basic design effectively
precludes frequently-required operation should be permanently enshrined as
the default alignment object.  I dont think so, but I also think that
theres an easy solution.  Instead of calling the module Bio::Seq::Aln, why
dont we call it Bio::Seq::GFAln (to credit Georg) or Bio::Seq::NucAln (to
reflect its main use for nucleic acid alignments, even though it can be
used for proteins).  Then someone else might write a module more focusing
on protein needs called Bio::Seq::ProtAln.  Eventually, it may be possible
to merge the objects to form a Bio::Seq::Aln.  Alternatively, one of them
might grow dominant and its name could be changed to Bio::Seq::Aln (while
old code would go on unaffected by the name change).

What do people think?  In terms of code-change, it should just mean
changing the Makefile.PL and the package line in the module.

As noted, I didnt spend as much time combing through the code of
Bio::Seq::Aln.  All of my noted are on paper that Im mailing to Georg.

-=-=-

A general note for both pieces of code is that we should assume that
people will use the bioperl routines just like theyd use anything else
from CPAN.  That is, they will download it and install it in the default
way by just saying perl Makefile.PL; make all; make test; make install.
Users should never be expected to edit the files.  Additional
documentation about use lib is ok for people who might not have access to
the system site_perl directory, but the INSTALLDIR notes are likely to be
confusing and unhelpful. 

Further, since people should not ever edit the modules, code which relies
on external items (e.g., PGPLOT; ReadSeq) should be in a separate module
file.  My feeling is that there should be a separate ReadSeq module which
does nothing but provide a wrapper around the ReadSeq routines.

Steve

P.S.  The Bio::Seq module on Georgs site seems to be corrupted; I couldnt
untar, uncompress, or gunzip it.  The one from Chriss site worked fine.

P.P.S.  PLEASE, as soon as possible, send me the postal addresses to which
my written comments should be sent.