[BioPython] Making the Seq object act more like a string

Sun Sep 9 21:17:04 UTC 2007

Peter wrote:
> I think having SeqRecord subclass Seq is nicer than simply adding 
> annotation to the Seq class. Seq objects would (still) just have a 
> sequence and alphabet, the SeqRecord becomes a rich/annotated Seq object.
> 
> I think this would be close to BioPerl's Seq and RichSeq objects.
> 
> I have filed an enhancement on Bugzilla to hold any suggested patches 
> etc (I hope to upload something later tonight):
> 
> Bug 2351 - Make SeqRecord subclass Seq subclass string?
> http://bugzilla.open-bio.org/show_bug.cgi?id=2351

Going back over the mailing list archives, we discussed something 
similar on the dev mailing list back in early 2005.

I would like to make the following "small" change now, ready for the 
next release of Biopython:

(1) Make __str__ give the full sequence as a string for Seq and
     MutableSeq objects, allowing intuitive use of str(myseq) which
     used to give a truncated representation including the alphabet.
(2) tostring() will be documented as deprecated in favour of str(...)
(3) leave __repr__ as is (giving the full string with an alphabet)
     which can be used with eval(repr(myseq)))

There will be some fallout to this - in particular we'll need to go over 
the documentation and may need to fix a few things.

The only downside is the loss of a built in method to get a "short seq 
string representation" (currently available as str(myseq) via __str__). 
  Back in 2005, Frédéric Sohm suggested adding short() method to do 
this. Personally I'd only use this when working at the command line, but 
it might be nice.  One refinement over the current truncation is I would 
personally include the last three letters - this is handy when looking 
at genes as you might want to know if there was a stop codon present.

e.g.

Seq('MLKILLATTMLIPTAFILKPQILHQTMISYTFILTLFSLIFLKQNQYLKPLSNLYLN...LVL', 
SingleLetterAlphabet())

rather than:

Seq('MLKILLATTMLIPTAFILKPQILHQTMISYTFILTLFSLIFLKQNQYLKPLSNLYLNLDQ ...', 
SingleLetterAlphabet())

and similarly for nucleotides (which is why I suggest at least the last 
three trailing letters).

Peter