[BioPython] Rethinking Seq objects

Tue May 3 05:27:39 EDT 2005

* Michiel Jan Laurens de Hoon <mdehoon at ims.u-tokyo.ac.jp> (20050503 15:45):
> Hi everybody,
> 
> Recently, there was a discussion on biopython-dev about changes to the Seq 
> and MutableSeq classses. I'd like to ask you if any of the proposed changes 
> would cause you any problems.
> 
> The current proposal is:
> 
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and 
> the MutableSeq class basically describe the same thing, except that one is 
> read-only and the other one is not. If desired, we can add a readonly flag 
> to the class to describe if it is mutable or not. (Given that e.g. 
> Numerical Python arrays don't have such a flag, my feeling is that it is 
> not really needed for Seq objects either). For performance reasons, the new 
> Seq class will be implemented in C.
> 
> 2) By default, a Seq class doesn't assume a particular alphabet. Same as 
> current behavior:
> >>>  from Bio.Seq import *
> >>>  Seq('ATCG')
> Seq('ATCG', Alphabet())
> However, if the user decides to specify the alphabet explicitly, input to 
> the sequence will be checked for consistency with the alphabet. So
> >>>  from Bio.Seq import *
> >>>  from Bio.Alphabet import IUPAC
> >>>  my_alpha = IUPAC.unambiguous_dna
> >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>>  s[:3] = "XYZ"
> will raise an error.
> 
> 3) Make Seq objects understand circular genomes. Many bacterial genomes are 
> circular. It would be nice if we could take the indices [-1000:1000] from a 
> Seq object, if it is circular, or [3999000:40001000] if the sequence is 
> circular with length 4000000.
> Circular genomes will likely be implemented as an optional keyword (perhaps 
> "topology") when creating the Seq object, with corresponding set_topology, 
> get_topology methods.
> 
> 4) Perhaps it would be a good idea to add transcribe and translate methods 
> to the Seq class. Currently, to translate a DNA sequence, we have to do
> >>> from Bio.Seq import Seq
> >>> from Bio import Translate
> >>> from Bio.Alphabet import IUPAC
> >>> my_alpha = IUPAC.unambiguous_dna
> >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>> standard_translator = Translate.unambiguous_dna_by_id[1]
> >>> standard_translator.translate(my_seq)
> Seq('AIVMGR*KGAR', IUPACProtein())
> which is too much typing for my taste.
> 
> 
> Questions/comments/suggestions are welcome. None of this has actually been 
> coded yet, so it's all still open to discussion.
> 
> 
> --Michiel.
> 

I agree with suggestions above , but I'd like to add a remark on the way 
in which the Seq object manage the alphabet used for the sequence more precisely
the case of the sequence.
just an exemple:

Python 2.3.4 (#1, Mar 11 2005, 17:34:27) 
[GCC 3.3.5  (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq_upper)
Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
>>> standard_translator.translate(my_seq_lower)
Seq('**********', HasStopCodon(IUPACProtein(), '*'))
>>> 

obviously the lower case doesn't work in the Seq object.
But I haven't neither exceptions at the Seq init nor during the translation.
worst I have a return value after the translate method but it doesn't mean anything. 
(it work of the same manner for the traduction).

I think it could be a good thing to correct this behavior.

-- 
Bertrand Neron

Groupe Logiciels et Banques de Donnees 
Institut Pasteur

Tel: 01 45 68 86 78
Fax: 01 40 61 30 80