[BioPython] Rethinking Seq objects
bneron at pasteur.fr
bneron at pasteur.fr
Tue May 3 05:27:39 EDT 2005
* Michiel Jan Laurens de Hoon <mdehoon at ims.u-tokyo.ac.jp> (20050503 15:45):
> Hi everybody,
>
> Recently, there was a discussion on biopython-dev about changes to the Seq
> and MutableSeq classses. I'd like to ask you if any of the proposed changes
> would cause you any problems.
>
> The current proposal is:
>
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and
> the MutableSeq class basically describe the same thing, except that one is
> read-only and the other one is not. If desired, we can add a readonly flag
> to the class to describe if it is mutable or not. (Given that e.g.
> Numerical Python arrays don't have such a flag, my feeling is that it is
> not really needed for Seq objects either). For performance reasons, the new
> Seq class will be implemented in C.
>
> 2) By default, a Seq class doesn't assume a particular alphabet. Same as
> current behavior:
> >>> from Bio.Seq import *
> >>> Seq('ATCG')
> Seq('ATCG', Alphabet())
> However, if the user decides to specify the alphabet explicitly, input to
> the sequence will be checked for consistency with the alphabet. So
> >>> from Bio.Seq import *
> >>> from Bio.Alphabet import IUPAC
> >>> my_alpha = IUPAC.unambiguous_dna
> >>> s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>> s[:3] = "XYZ"
> will raise an error.
>
> 3) Make Seq objects understand circular genomes. Many bacterial genomes are
> circular. It would be nice if we could take the indices [-1000:1000] from a
> Seq object, if it is circular, or [3999000:40001000] if the sequence is
> circular with length 4000000.
> Circular genomes will likely be implemented as an optional keyword (perhaps
> "topology") when creating the Seq object, with corresponding set_topology,
> get_topology methods.
>
> 4) Perhaps it would be a good idea to add transcribe and translate methods
> to the Seq class. Currently, to translate a DNA sequence, we have to do
> >>> from Bio.Seq import Seq
> >>> from Bio import Translate
> >>> from Bio.Alphabet import IUPAC
> >>> my_alpha = IUPAC.unambiguous_dna
> >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>> standard_translator = Translate.unambiguous_dna_by_id[1]
> >>> standard_translator.translate(my_seq)
> Seq('AIVMGR*KGAR', IUPACProtein())
> which is too much typing for my taste.
>
>
> Questions/comments/suggestions are welcome. None of this has actually been
> coded yet, so it's all still open to discussion.
>
>
> --Michiel.
>
I agree with suggestions above , but I'd like to add a remark on the way
in which the Seq object manage the alphabet used for the sequence more precisely
the case of the sequence.
just an exemple:
Python 2.3.4 (#1, Mar 11 2005, 17:34:27)
[GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq_upper)
Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
>>> standard_translator.translate(my_seq_lower)
Seq('**********', HasStopCodon(IUPACProtein(), '*'))
>>>
obviously the lower case doesn't work in the Seq object.
But I haven't neither exceptions at the Seq init nor during the translation.
worst I have a return value after the translate method but it doesn't mean anything.
(it work of the same manner for the traduction).
I think it could be a good thing to correct this behavior.
--
Bertrand Neron
Groupe Logiciels et Banques de Donnees
Institut Pasteur
Tel: 01 45 68 86 78
Fax: 01 40 61 30 80
More information about the BioPython
mailing list