[BioPython] sequence proposals (long)

Thu, 30 Mar 2000 14:24:09 -0700

> Sure.  I don't think it would be a failure if biopython were
> to make sequences classes that were biased (even heavily) toward
> python's way of doing things.  I'd rather have something that
> works well here, rather than sequences that suck equally on
> all languages!  ;)

Yes.  But if there are three "natural" ways to do something
in Python, and one of them is common with the Perl and Java
ways, then I would rather chose the common one.

> Do you mean sequences should support the same slicing semantics?

Yes.

> After python 1.6, strings will become objects, with their own
> methods, and how string objects and biological sequences act
> will diverge.

Ohh, good point.  Currently a string is little more than a byte
array, and I was thinking just of the list-like interfaces.

> [...  I'm liberally cutting things from the email, for length and
> relevance reasons.  I hope that I'm not leaving anything without
> the proper context.  I apologize if I do!]

No problem.  The extra text was commentary/justification meant
to back up my proposals.  I was thinking of resending just the
proposal part; thanks for doing so.

> It feels to me like the semantics of the indexes for a method
> call is less stringently enforced than that of subscripting,
> where the syntax is built into the language.

I hadn't thought of that before.  Sounds reasonable.

> What's a stride?

The step length in a slice.  The default is 1, so [1:5] returns
4 characters.  [1:5:2] returns 2 characters (as positions 1 and 3).

>>> import string, Numeric
>>> a = Numeric.array("This is a test.")
>>> import string
>>> string.join(a[1:12:2], "")
'hsi  e'
>>> 

> I'm not sure if you've mentioned it explicitly, but
> we're going to need both mutable and immutable sequences.  

I hadn't mentioned it, but it is true.  I've got classes for
both types; the mutable one is based off of array.array.

> Because it would be so hairy otherwise, I propose that any
> annotated sequences must be immutable

Agreed.

> However, the Numeric way does save a lot of memory when accessing
> just a region of a large matrix (or DNA sequence).

I was thinking about that.  It's also possible to have subsequences
return a proxy object, which references back to the main sequence
only when needed.  There's a higher per-subsequence object cost,
but the original object was really big.

It becomes rather more difficult to implement and use these.
What is a case where the subsequence copies must be nearly genome
sized?

> Yes, but I'm not sure we need to allow this kind of flexibility.
> I believe str should just return a human-readable string, and
> leave specialized formatting to other functions.

You are right.  Using the stringification operator is not the
right choice.  Looking at the other character array objects
(Numeric.array and array.array), the proper method is "tostring()".

If Python 1.6 strings have a tostring() method, returning itself,
then I would be pleased.  I'll ask about/for that on the Python
list.

> It depends on what you consider the sequence length.
> 
> I don't consider "A-T" to be a biological sequence.  

Right.  I've since changed my alphabet proposal so that
gaps are not types of physical alphabets, but are encodings
around alphabets.

> For example:
> >>> seq = GappedSequence("AT-G--C")
> >>> seq[1:3]
> 'TG'
> >>> seq.gapped[1:3]
> 'T-G'

I would have it the other way around, where the default subscript
contains the '-' and the ".ungapped" attribute yields the sequence.
This makes it easier to compare relative positions of a sequence
with a gapped sequence.

                    Andrew
                    dalke@acm.org