[BioPython] Making the Seq object act more like a string
Peter
biopython at maubp.freeserve.co.uk
Sun Sep 9 21:17:04 UTC 2007
Peter wrote:
> I think having SeqRecord subclass Seq is nicer than simply adding
> annotation to the Seq class. Seq objects would (still) just have a
> sequence and alphabet, the SeqRecord becomes a rich/annotated Seq object.
>
> I think this would be close to BioPerl's Seq and RichSeq objects.
>
> I have filed an enhancement on Bugzilla to hold any suggested patches
> etc (I hope to upload something later tonight):
>
> Bug 2351 - Make SeqRecord subclass Seq subclass string?
> http://bugzilla.open-bio.org/show_bug.cgi?id=2351
Going back over the mailing list archives, we discussed something
similar on the dev mailing list back in early 2005.
I would like to make the following "small" change now, ready for the
next release of Biopython:
(1) Make __str__ give the full sequence as a string for Seq and
MutableSeq objects, allowing intuitive use of str(myseq) which
used to give a truncated representation including the alphabet.
(2) tostring() will be documented as deprecated in favour of str(...)
(3) leave __repr__ as is (giving the full string with an alphabet)
which can be used with eval(repr(myseq)))
There will be some fallout to this - in particular we'll need to go over
the documentation and may need to fix a few things.
The only downside is the loss of a built in method to get a "short seq
string representation" (currently available as str(myseq) via __str__).
Back in 2005, Frédéric Sohm suggested adding short() method to do
this. Personally I'd only use this when working at the command line, but
it might be nice. One refinement over the current truncation is I would
personally include the last three letters - this is handy when looking
at genes as you might want to know if there was a stop codon present.
e.g.
Seq('MLKILLATTMLIPTAFILKPQILHQTMISYTFILTLFSLIFLKQNQYLKPLSNLYLN...LVL',
SingleLetterAlphabet())
rather than:
Seq('MLKILLATTMLIPTAFILKPQILHQTMISYTFILTLFSLIFLKQNQYLKPLSNLYLNLDQ ...',
SingleLetterAlphabet())
and similarly for nucleotides (which is why I suggest at least the last
three trailing letters).
Peter
More information about the Biopython
mailing list