[Biopython-dev] Rethinking Seq objects
Michiel Jan Laurens de Hoon
mdehoon at ims.u-tokyo.ac.jp
Wed Apr 27 02:18:19 EDT 2005
Hi everybody,
For my research, I tend to work a lot with sequences, but I find myself not
using Bio.Seq much. I'd like to propose some changes to make sequence objects
more useful. I'd be happy to hear comments from the other developers, in
particular the original developers who probably thought this through much more
than I have.
There are five changes I'd like to propose:
1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the
MutableSeq class basically describe the same thing, except that one is read-only
and the other one is not. If desired, we can add a readonly flag to the class to
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't
have such a flag, my feeling is that it is not really needed for Seq objects
either).
2) Make Seq objects a bit smarter about which type of sequence they contain. One
reason I don't use Bio.Seq much is that I have to write
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> from Bio.Seq import Seq
>>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
which is too much typing. I am thinking about the following scheme when
initializing a Seq object:
- If the user specifies my_alpha, accept that alphabet. Raise an error if the
sequence is not consistent with the alphabet
- Assume the sequence is an unambiguous DNA sequence
- If the sequence contains any characters other than ATCG, assume it is
unambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than AUCG, assume it is
a protein, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWY,
assume it is ambiguous DNA, otherwise accept the sequence
- If the sequence contains any characters other than GATCRYWSMKHBVDN, assume it
is ambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than GAUCRYWSMKHBVDN, assume it
is an extended protein sequence, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWYBXZ,
yell at the user.
3) When changing a sequence, check if it is still consistent with the alphabet.
Right now, we can do
>>> from Bio.Seq import *
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> my_seq[:10] = "weirdstuff"
>>> my_seq
MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), IUPACUnambiguousDNA())
4) Make Seq objects understand circular genomes. Many bacterial genomes are
circular. It would be nice if we could take the indices [-1000:1000] from a Seq
object, if it is circular, or [3999000:40001000] if the sequence is circular
with length 4000000.
5) Perhaps it would be a good idea to add transcribe and translate methods to
the Seq class. Currently, to translate a DNA sequence, we have to do
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.
Any thoughts/comments/suggestions?
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
More information about the Biopython-dev
mailing list