[Biopython-dev] Rethinking Seq objects

Wed Apr 27 02:18:19 EDT 2005

Hi everybody,

For my research, I tend to work a lot with sequences, but I find myself not 
using Bio.Seq much. I'd like to propose some changes to make sequence objects 
more useful. I'd be happy to hear comments from the other developers, in 
particular the original developers who probably thought this through much more 
than I have.

There are five changes I'd like to propose:

1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the 
MutableSeq class basically describe the same thing, except that one is read-only 
and the other one is not. If desired, we can add a readonly flag to the class to 
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't 
have such a flag, my feeling is that it is not really needed for Seq objects 
either).

2) Make Seq objects a bit smarter about which type of sequence they contain. One 
reason I don't use Bio.Seq much is that I have to write
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> from Bio.Seq import Seq
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
which is too much typing. I am thinking about the following scheme when 
initializing a Seq object:
- If the user specifies my_alpha, accept that alphabet. Raise an error if the 
sequence is not consistent with the alphabet
- Assume the sequence is an unambiguous DNA sequence
- If the sequence contains any characters other than ATCG, assume it is 
unambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than AUCG, assume it is
a protein, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWY, 
assume it is ambiguous DNA, otherwise accept the sequence
- If the sequence contains any characters other than GATCRYWSMKHBVDN, assume it 
is ambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than GAUCRYWSMKHBVDN, assume it 
is an extended protein sequence, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWYBXZ, 
yell at the user.

3) When changing a sequence, check if it is still consistent with the alphabet. 
Right now, we can do
 >>> from Bio.Seq import *
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> my_seq[:10] = "weirdstuff"
 >>> my_seq
MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), IUPACUnambiguousDNA())

4) Make Seq objects understand circular genomes. Many bacterial genomes are 
circular. It would be nice if we could take the indices [-1000:1000] from a Seq 
object, if it is circular, or [3999000:40001000] if the sequence is circular 
with length 4000000.

5) Perhaps it would be a good idea to add transcribe and translate methods to 
the Seq class. Currently, to translate a DNA sequence, we have to do
 >>> from Bio.Seq import Seq
 >>> from Bio import Translate
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> standard_translator = Translate.unambiguous_dna_by_id[1]
 >>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.

Any thoughts/comments/suggestions?

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon