New Bio::Seq and Bio::Seq::Parse (.025 BETA)

Steven E. Brenner brenner@akamail.com
Tue, 18 Mar 1997 21:12:35 +0900 (JST)


Ok; I'm logged in over a pay phone line at 2400 baud, so these have to be
really quick.  But I'll have no access for another two weeks, so here
goes.  Apologies for brusquness


On Tue, 18 Mar 1997, Georg Fuellen wrote:

> 
> Hi,
> 
> I've collected everything into one long message...
> Read the first 100 lines, get a coffee, read the next 100 lines, etc :-)
> 
> Steve Br. wrote,
> > > >   A few other nits from a _very_ cursory look-through
> > > > 
> > > > @SeqForm appears never to be created
> > > > 
> > > > I would change [@%]SeqForm to [@%]SeqFmt, or even [@%]seq_fmt (to be
> > > > consistent with the rest of the naming). 
> > > 
> > > I think then we should have seq_ffmt.
> > > Then again, doesn't SeqForm hint at the fact that these variables are 
> > > very special ?
> > 
> > Good Point; these are supposed to be constants, after all.  Then how about
> > SeqFmt? 
> 
> Well, then we should leave it as is, i.e. SeqForm, and save us the hassle 
> of changing names. IMHO. Steve, Chris, do you want a change? 
> [ ] yes 

YES.  Just to a query/replace for SeqForm with SeqFmt.  Should take about
2 minutes, I hpoe.


> [ ] no
> If we change, SeqFfmt looks more consistent to me than SeqFmt.

It would be SeqFFmt, and I think that just unnecesary typing and no
additional clarity (for me anyway).




> > > > There's no 'valid' field to indicate whether or not the object is indeed
> > > > valid for any operation.  For example, if setseq is used to set an invalid
> > > > sequence.  
> > > 
> > > What if we don't allow this to happen ?
> > > If we keep the object valid all the time ?
> > 
> > That means that we have to 'croak' on any error rather than carp.  
> 
> Why not just refuse to make changes that invalidate the object, AND carp.
> As you note yourself, 'croak' should be avoided.

I'm not sure this is always possible.  When it is possible, it may require
doing non-intuitive things.  What if someone initializes a bio::seq to be
DNA but puts in illegal bases? 


> > > > Functions which can return an invalid result (such as parse_bad) should
> > > > return undef ratehr
> > > 
> > > You mean, rather than 0 ? I thought zero and the null string ("") 
> > > are interpreted as false, and returning 0 or "" seems the standard 
> > > convention, no ?
> > 
> > No.  undef, 0, and "" are all 'false' in Perl.  However undef is
> > qualitiatively different in that it returns 'false' to the defined()
> > function.  The others don't.  undef is "more false" than the others. 
> > Therefore failures are always supposed to be indicated as undef 
> > (except from syscall() and system() )
> 
> I gather that severe failures should return ``undef'', and 
> ordinary ones should return 0/"". That would also be consistent w/ the 
> code samples I saw, which return 0 on failure quite often like UNIX does.
> Alternatively, I could gather from your statement that you're feeling
> strongly about returning undef everywhere, and that may be much easier to
> maintain - less uncertainty about return values. Pls reply - per default, I'll
> keep things as they are. (running out of time, I tend to be conservative:)

undef should be always returned if the operation fails.  I haven't looked
at UnivAln returns much; the case which I cited originally was where a
file-read failed.  That is a failure


> > > The problem w/ comma-separated is that according to our current
> > > specs, comma is a legal component of an ID; we only carp on whitespace.
> > > In other words, ``Mus,musculus'' is a legal ID.
> > > Since non-whitespace is also a legal component of filenames on many systems 
> > > I believe, I'd like to keep the convention.
> > 
> > I thought ID's had to be in '\s'; if not, maybe they should be.  Further,
> 
> Do you mean ``\S'', i.e. everything but space and ``\t\n\r\f''  ??

Ooops.  Meant \w.  Sorry!!!


> > whitespace is a legal component of most filesystems.  (It is on Unix,
> > Macintosh, and Windows, for example). 
> 
> Space (`` '') may be OK, but newline (``\n'') certainly not ?!

On unix, at least, \n certainly is valid.  Typically the only illegal
characters are "\0" and "/".  Some filesystems even allow those.


> > An array seems to me to be "right" way to do this, I think.  But I thought
> > we were talking about numeration (rather htan identifiers anyway).
> 
> Misunderstanding. In the current implementation, identifiers ARE USED
> to support arbitrary numbering, I've merged both concepts into one !!!

Ahh.  Didn't realize this.  That would potentially lead to inefficiency,
but certainly has an elegant sound.  Would need to see the code.


> > > > an array of strings is probably even better still, as that's presumably
> > > > what you use inside the routines that deal with these things.
> > > 
> > > Arrays of integers are interpreted as index lists; since names may be
> > > integers as well, and Perl doesn't really distinguish integers and strings,
> > > how do you want to do this ?
> > > (Of course, the system under discussion can allow {string=>\$sting_of_names}
> > > as a parameter for seqs().)
> > 
> > I don't follow -- probably because I haven't spent enough time studying
> > UnivAln
> >
> > > > The numbering in the code still seems pretty poorly documented/determined.
> > > 
> > > Pls be more specific..
> > 
> > You sent an email saying that UnivAln supported arbirary numbering
> > schemes.  I saw no documentation (even in comments) about this anywhere.
> > There was lots of code passing around 'numering,' without ever
> > saying what it was supposed ot be.
>  
> Misunderstanding, see above. Arbitrary numbering == arbitrary identifiers
> (aka ids, aka names) for the __columns__ !!!

Again; sounds neat, but need to see the code.


> > > > I agree that a hash permits many options.  But that potentially
> > > > just indicates lack of clear thinking and good design.  A tenet of OO
> > > > design is that you shouldn't have redundant interfaces; they raise the
> > > > learning curve (because there are more options to learn) and make the code
> > > > less efficient and more error-prone.
> > > 
> > > Since ARRAY, CODE and scalar are already taken as the possible type of the
> > > first real parameter of seq(), HASH seems ideal.
> > 
> > I'm not saying that using a HASH is bad (though I would tend to aruge that
> > this means that we should reconsider the parameters to the seq()
> > function).  What I am saying is that allowing multiple ways of specifying
> > the same data via a hash is generally bad design.
> 
> In this specific case it looks like the ideal design...  IMO !!
> * It's not more redundant than allowing different named parameters for a 
> function, ``-seqs'',``-file'', etc, VERSUS ``ids=>'',``descs=>'',``string=>''.

Not really.  The purpose of having a hash with optinal parameters is so as
to avoid having to make a function call with ZILLIONS of parameters.
The code fills in the optional parameters with default values.  At
least, that was the purpose in CGI.pm.    

The different paramers can also (very very carefully) be used to allow
different purposes (like overloading in c++)

> * Learning curve: In the spirit of Perl, what you don't know won't hurt 
> you much; it's all about added convenience.

No; it will hurt you if you need to figure out different routes.  Perls
does have lots of differnt ways of doing things, but they are typcially
not trivial to interconvert.


> * less efficient: I had to add one check ``if (ref($...) eq 'HASH')''
> and if it's a hash, I will need to add a switch that takes care of
> the different possible key values. 

1) That switch will be inefficient
2) The code inside will have to convert from one format to antoher
   which is inefficient
3) The final code which carries out operatin will be less optimized
   because (a) points (1 and 2) mean that it isn't worth optimizing
   and (b) all the programming time will have been spent writing
   points (1) and (2)


>In general, I value programmer+
> +maintainer+user efficiency more than space+time efficiency, and 
> I believe that you usually cannot predict where the space+time
> bottlenecks are - the better approach is to only make space+time efficiency 
> a big deal for double loops and square/cubic data structures, AND
> benchmark the code in real applications for everything else.
> E.g. I don't worry about the row/column ids since they're a linear
> (not square/cubic) data structure, but I do worry most about
> the square array of characters that represents the alignment, and that's
> why I feel that using a PDL structure for this would imply savings in
> time+space that are several magnitudes higher than anything else I could
> do; and such a change (_and_ giving the user the option to use the
> regular array of array of characters if s/he needs to) should be easy
> in the OO world... once PDL is stable.

I agree with most of the above, especially with programmer and maintainer
efficiency being paramount.  But adding additonal parameters is simply
planting a time-bomb for maintainers and adding to the curve for the
programmer.



> * more error-prone: Adding convenience features like access by name to
> rows and columns makes the code more error-prone, naturally. But you
> get a big benefit, among them support for arbitrary numbering schemes.

Access-by-name is hardly a conveninece matter.  Ohterwise you'd need to
write complex separate code to deal iwth it.

> > > > I note that you're still using %FormUnivAln and %TypeUnivAln rather than
> > > > the arrays @UnivAlnType and @UnivAlnForm.  These should be arrays, not
> > > > hashes.
> > > 
> > > You mean, @UnivAlnType = ('Unknown','Dna','Rna','Amino','OtherSeq') and 
> > > @UnivAlnForm = ('unknown','raw','fasta','nexus') ? On second thoughts,
> > > I must admit I fail to remember the advantages, but can clearly see
> > > the disadvantages; given ``fasta'', how do you find out what the corresponding
> > > number is ? It's my feeling that this is a costly change on which I'll spend 
> > > hours, _or_ I just misunderstand.
> > 
> > The idea was that you would have
> > 
> > @UnivAlnType = ('unknown','dna','rna','amino','other'); #note lower case
> 
> Lower case? Please... I'm really running out of time! Such a change consumes
> a lot of time, updating docu, test scripts, my own research code, etc !!
> (Also, let's not completely forget the beta testers I mailed personally.)
> [ ] yes, I really think lower case is much better in this case as well
> [ ] let's keep things the way they are

We discussed this before iwth Fasta/FastA/FASTA/fasta.  Changing all of
these to lower case follows the same rationale  (DNA/dna/Dna?)
OtherSeq/Otherseq/otherseq.  Everything should be kept consistent, and
lower case is easy for htis

> > foreach  $i (0..$#UnivAlnType) {
> >   %UnivAlnType{$UnivAlnType[$i] = $i;
> > }
> > 
> > This way we can index from number to string with @UnivAlnType and from
> > string to number with %UnivAlnType.
> > 
> > The problem is that you have replaced @UnivAlnType with %TypeAlnUniv... 
> > and you're putting a number as the parameter to a hash.  This is
> > inefficient. But worse, it can lead to problems because $foo =" 1" would
> > give the right results in $UnivAlnType[$foo] but not in
> > %TypeAlnUniv{$foo}
> > 
> > To restate, to go from a string to a number use a  hash
> >             to go from a number to a string use an array
> 
> Good point. Now I think I understand; will change this asap.

Thanks!  Sorry if I was unclear intially.  (The origianl problem, again,
came from my prototype code.  Sorry.)

> > > o Site-specific configuration issues.
> > > Right now, Seq.pm does not have to be edited by users but Parse.pm and the
> > > test scripts do. I'm going to hit the POD docs for MakeMaker, etc. and try
> > > to figure out how setup a system where users edit a ".config" file or
> > > somesuch and the resulting info is used to automatically tweak Parse.pm and
> > > Seq.pm during the 'make' process. Again, any help/suggestions on this would
> > > be appreciated.
> > 
> > Again, I'm not sure of the right thing to do here; I haven't worked with
> > MakeMaker much before.
> > 
> > Probalby the right think to do is to have a real make, which runs a
> > program which spits out a Parse.pm.  (i.e., there's no Parse.pm in the
> > distribution, but it is the output of a ParseMaker Perl script which
> > queries users for file locations, etc.)  One place to possibly look for
> > guidance are things like PGPLOT which require external programs and
> > libraries.
> 
> PGPLOT has C _and_ Fortran, I think we'll spend a long time figuring
> out what's going on there. I hope there's a better example somewhere,
> maybe Chris should post to c.l.p.m ?!
> 
> > If you are really pressed, I think it would be ok to simply set the
> > default to be for $OK to be false and force people to edit things (before
> > installation) to set them right.
> > 
> > > o Proposed validity markers
> > >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> > >   - A marker to specify valid/invalid biosequence object
> > > Are these permutations of the same idea or two different things? I'm also
> > > not sure about how to implement.
> > 
> > Yes.  These are the same thing.  Basically, there should be a 'valid'
> > flag, and the code should carp() or croak() on any operation if the valid
> > flag is not set.
> 
> I really like the ``always valid'' approach, the more I think about the issue.
> Just refuse to have any invalid object ever created. But be very restrictive
> w/ the definition of ``invalid''; in a lot of cases carp() is enough,
> and after the carp() the user has to expect warnings (like ``use of
> unitialized value'') and possibly fatal dies for certain operations.

Good point.  Always valid is a good idea; but that may require croaking on
user setting options or having the object valid with unexpected data.
Maye carps with the unexpected data is a better choice.  I could live with
that.

> > Alternatively (as mentioned in hte previous mail), croak()-ing on any
> > failure would always ensure that the object is valid.  It would
> > potentially cause programs to die often.
> 
> And that's not good. Perl itself usually makes the best of a situation;
> the spirit is to prefer warnings to ``die''. E.g. if I use ``=='' on string
> values, Perl will warn, but not die.

Ahh, but this is different.  == is a valid operation on striongs; you just
probably didn't mean to use it.  "1" == " 1number" is true, while "1" eq "
1number" is false.  You might have intended to do the former, and it is a
valid operation.

On the other hand, perl will die if you try to do something which is
invalid.

> > > o Default constructor ID
> > > Steve commented that the default constructor ID should be changed from
> > > "No_Id_Given" to "No_Id" plus a unique number. Assigning a number is easy
> > > enough but how would you keep track of "unique" numbers assigned? Is there
> > > a way to save state or remember these numbers each time new() is called? I
> > > think I see the potential problems that objects with the same 'ID' field
> > > could cause but I'm unsure how a 'unique' naming process would work.
> > 
> > in the package have a package global something like
> > 
> > my $UniqNum = 1234;
> > 
> > and also have a function something like
> > 
> > sub uniq_num {
> >   return $UniqNum++;
> > }
> 
> Hm. What about ids that we inherit from somewhere ? E.g. from a file ?
> On a parallel machine, this won't work either I think. What about other
> distributed computation; CORBA may offer solutions, but it's another
> big can of worms although I feel that we'll have to open it at some time -
> does anyone know more about CORBA ? (I've just heard rumors! :)

Why would you inherit ids?  These ids are ONLY for setting names of
bio::seq's.  I don't see how parallel programs and/or CORBA have anything
to do with it.  We only need to guarantee that id's are unique within a
given program.


> >... If possible, I would like to permit both cases as people sometimes use
> > them to mean different things.  We may want to add upcase() and downcase()  
> > [or something like that maybe toupper() and tolower()].
> 
> to_upper and to_lower ?

toupper and tolower are C functions, so it makes sense to follow that
convention, IMHO.  (Ohterwise, I would probably chose somethng else)

> > > > o Proposed validity markers
> > > >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> > > >   - A marker to specify valid/invalid biosequence object
> > > > Are these permutations of the same idea or two different things? I'm also
> > > 
> > > They are both ways of defining what ``valid'' is. ..
> > > For me a valid object conforms to some requirements, like (for UnivAln), 
> > > that $self{type} is correct (especially that it reflects the fact that the 
> > > alignment is just a sequence bag, i.e.  the rows are of different length), 
> > > $self{id} has no whitespace, $self{desc} conforms to $self{descffmt}, 
> > > $self{row_ids}, etc, have the correct size.
> > > This is something I don't have time for right now, but it's needed eventually.
> > 
> > You don't want to have to check all those various things every time you do
> > an operation.  It would be much similar to have a $valid flag which is set
> > or cleared after every operation which changes internal variables which
> > could affect validity.
> 
> As above, my current thinking is that ``always keep the object valid''
> is the cleanest approach. What are the downsides ? 
> 
> Steve Ch wrote,
> > >o Proposed validity markers
> > >   - A marker that would be set to 'false' whenever Seq.pm makes a call to carp()
> > >   - A marker to specify valid/invalid biosequence object
> > > Are these permutations of the same idea or two different things? I'm also
> > > not sure about how to implement.
> > 
> > I think this (and Steve B's recent comments on this issue) opens up an 
> > issue that could use some discussion: how to best handle errors and 
> > exceptions in Perl objects. I've created some modules that I use to help 
> > manage errors. See the "More advanced object" example at: 
> > 
> > http://genome-www.stanford.edu/~sac/perlOOP/examples/
> > 
> > This is my attempt to manage the wide variety of errors and 
> > exceptions that can occur in complex objects. The primary motivation 
> > for this work is to allow objects to handle error conditions without 
> > killing the script by calling die or croak. The code is at an early 
> > stage of development (it hasn't received much independent critiquing),
> > but it may inspire some useful ideas.
> 
> My first q. on this line is - doesn't Perl 5.004 (just out as a late beta)
> have much more support for exception handling than the current one ?

Some, I think; not much.  Perl does have very good (if slighly
inconvenient) support for exceptions via the eval structure.

> At least, it offers class SUPER, which gives you a way to check what
> methods a given object is capable of. If it's not in 5.004, are there
> plans for 5.005 ? More generally, sophisticated exception handling is a 
> complex subject - we at least need the independent critiquing of someone 
> who has experience with it. IF exception handling were trivial+easy,
> I suppose Perl would offer this already, no ? I remember that it was
> one of the things added last to C++ a few years back, and ppl weren't
> really happy with it.

5.005 has little/no user-visible changes.  It is solely for modfications
to aid implementation of the Perl-to-C translator.