[BioPython] bioperl idl

Tue, 21 Sep 1999 15:44:11 -0600

Told you no one would agree on numbers..

Jeffrey Chang <jchang@SMI.Stanford.EDU>:
> 1- based system
> I personally prefer that the end is exclusive.

Ewan Birney <birney@sanger.ac.uk>:
> I prefer knuth/C style numbering system, but the rest
> of bioinformatics has standardised on "biological" numbering
> schemes (ie, 1..3 is a 3 long string, starting at the first place).

Bradley Marshall <bradbioperl@yahoo.com>:
> Speaking from a biologists perspective, if I ask for
> 1-3 of the sequence 'atgtg' and get anything but 'atg'
> I'm gonna be pissed. (But not in the British sense.)

There's actually several different points here.

First, Python is 0 based with the end being exclusive.
Suppose someone starts working with Python for doing sequence
analysis, and reads a sequence into a string.  They are most
likely not going to put a dummy spacer value into the first
character.

Let seq1 = "GATTACA".  Then seq1[1:3] == "AT"

Biopython wants to make more powerful/useful classes which
describe the sequence, so I think we should build on acting
like a string, though with a few additional methods.

Let seq2 = Bio.Seq("GATTACA").  Then for consistancy with
a string, seq2[1:3] must == "AT".

Many algorithms don't care if the underlying representation
is as a string or as a Seq class -- they expect to access a
range of characters.  By using a notation different than the
underlying language, it becomes more cumbersome.  You'll have
to write a wrapper layer between one representation and the
other.

This is completely doable, but inefficient.  You want to reduce
the number of wrappers in the code.

That's the next point.  It is fully understood that different
implementations have difference ides of what "first" and "last"
mean, and that you'll have to write interface wrappers for it.
So something can be built to interface to Ewan's IDL and get
the numbering correct (both as a client and a server).

The key, I think, it to understand where the interfaces are.

Take Brad's case, where a biologist is asking for positions 1-3
of the sequence, and expects the 1st, 2nd and 3rd terms. This
is an interface; an interface between the biologist and the
software.  This is (to me) the natural place to put translation
code.

In another corner, you have someone writing a new library
package for the software, and someone who is familiar with the
different ways to start/end ranges.  In that case, there is
still an interface between the old code and the new one, but there
is the expectation that the interface should be minimized, so
consistancy with the underlying system is the best.

Additionally, this interface mismatch only needs to be solved once,
as compared to working with the biologist where it needs to be
done every time.

Therefore, the answer I've said to this argument before is that
only the top-level "talking to the biologist" (or the outside
world) layer does the translation to the internal representation.
Everything else, libraries included, should and must use the
system's representation.

Otherwise mismatches and extra translation layers will pop up and
make the code all ugly -- and not in the skin deep sense either.

So with Ewans's IDL (which is meant to be an interface between
system), it doesn't matter what solution is picked, and there is
no best solution, so long as it is well defined.

As for the core Biopython system, the best solution is to stay
Python's solution "Andrew"[1:3] == "nd".

We now return you to your regularly scheduled email :)

						Andrew
						dalke@bioreason.com