[BioPython] Rethinking Seq objects

Tue May 3 02:45:00 EDT 2005

Hi everybody,

Recently, there was a discussion on biopython-dev about changes to the Seq and 
MutableSeq classses. I'd like to ask you if any of the proposed changes would 
cause you any problems.

The current proposal is:

1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the 
MutableSeq class basically describe the same thing, except that one is read-only 
and the other one is not. If desired, we can add a readonly flag to the class to 
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't 
have such a flag, my feeling is that it is not really needed for Seq objects 
either). For performance reasons, the new Seq class will be implemented in C.

2) By default, a Seq class doesn't assume a particular alphabet. Same as current 
behavior:
 >>>  from Bio.Seq import *
 >>>  Seq('ATCG')
Seq('ATCG', Alphabet())
However, if the user decides to specify the alphabet explicitly, input to the 
sequence will be checked for consistency with the alphabet. So
 >>>  from Bio.Seq import *
 >>>  from Bio.Alphabet import IUPAC
 >>>  my_alpha = IUPAC.unambiguous_dna
 >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>>  s[:3] = "XYZ"
will raise an error.

3) Make Seq objects understand circular genomes. Many bacterial genomes are 
circular. It would be nice if we could take the indices [-1000:1000] from a Seq 
object, if it is circular, or [3999000:40001000] if the sequence is circular 
with length 4000000.
Circular genomes will likely be implemented as an optional keyword (perhaps 
"topology") when creating the Seq object, with corresponding set_topology, 
get_topology methods.

4) Perhaps it would be a good idea to add transcribe and translate methods to 
the Seq class. Currently, to translate a DNA sequence, we have to do
 >>> from Bio.Seq import Seq
 >>> from Bio import Translate
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> standard_translator = Translate.unambiguous_dna_by_id[1]
 >>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.

Questions/comments/suggestions are welcome. None of this has actually been coded 
yet, so it's all still open to discussion.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon