[BioPython] (no subject)

Thu May 5 09:29:46 EDT 2005

Hi Michiel and everyone,

Just a thought, don't flame me for that.
Since you will be making a new Seq object, will it be worth making it behave
more like a typical object :

But first a disclaimer, I realise the proposed change could mean breaking a
lot of code, so it might a very bad idea in the end.

When I did first used Biopython, I have been surprised by the behaviour of
 Seq object, in regards of the use of the built-in str() and repr() functions
 (I should have read the manual first, but hey...) :

Ok here is a the Seq behaviour :
>>> from Bio.Seq import Seq
>>> a = 'a'*80
>>> a

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa'

>>> s = Seq(a)
>>> s

Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())

>>> str(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaa ...', Alphabet())"

>>> repr(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())"

>>> s.tostring()

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa'

Now here is  what I was expecting at the time following the respective
 meaning of str and repr

>>> a = 'a'*80
>>> a

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa'

>>> s = Seq(a)
>>> s

Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())

>>> str(s)

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa'

>>> repr(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())"

So what I would propose is to :
 change str(seq) to return the actual sequence as do seq.tostring() right
 now. leave repr(seq) as it is,
make seq.tostring()  return str(seq) for backward compatibity. (Would be
eventually removed).
add a new function Seq.short() for example which would behave like the actual
str(Seq).

I don't have any idea how much code this would break. And the feasability of
it will as well depends on the way the new Seq will be release (I mean do you
plan to have the actual Seq and the new one co-existing for a while or to
directly replace the old Seq?).
If the later is the way we go this change is certainly not desirable,
otherwise it might be something to consider.

Personally I have mix filling about it, but I think it is worth discussing
 the matter now.

This change would make the Seq objects behave more like a Python programmer
would expect, on the other hand Biopython have been built on the current
model and this might be a bad idea to change after so much time.

Since the only real problem with this is the replacement of the str() method
all boiled down to how frequently people use the actual string method of Seq
in their code?
I do not have the impression it is very frequent but ...

What do you think ?

Fred

Le mardi 3 Mai 2005 08:45, Michiel Jan Laurens de Hoon a écrit :
> Hi everybody,
>
> Recently, there was a discussion on biopython-dev about changes to the Seq
> and MutableSeq classses. I'd like to ask you if any of the proposed changes
> would cause you any problems.
>
> The current proposal is:
>
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and
> the MutableSeq class basically describe the same thing, except that one is
> read-only and the other one is not. If desired, we can add a readonly flag
> to the class to describe if it is mutable or not. (Given that e.g.
> Numerical Python arrays don't have such a flag, my feeling is that it is
> not really needed for Seq objects either). For performance reasons, the new
> Seq class will be implemented in C.
>
> 2) By default, a Seq class doesn't assume a particular alphabet. Same as
> current
>
> behavior:
>  >>>  from Bio.Seq import *
>  >>>  Seq('ATCG')
>
> Seq('ATCG', Alphabet())
> However, if the user decides to specify the alphabet explicitly, input to
> the sequence will be checked for consistency with the alphabet. So
>
>  >>>  from Bio.Seq import *
>  >>>  from Bio.Alphabet import IUPAC
>  >>>  my_alpha = IUPAC.unambiguous_dna
>  >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>  >>>  s[:3] = "XYZ"
>
> will raise an error.
>
> 3) Make Seq objects understand circular genomes. Many bacterial genomes are
> circular. It would be nice if we could take the indices [-1000:1000] from a
> Seq object, if it is circular, or [3999000:40001000] if the sequence is
> circular with length 4000000.
> Circular genomes will likely be implemented as an optional keyword (perhaps
> "topology") when creating the Seq object, with corresponding set_topology,
> get_topology methods.
>
> 4) Perhaps it would be a good idea to add transcribe and translate methods
> to the Seq class. Currently, to translate a DNA sequence, we have to do
>
>  >>> from Bio.Seq import Seq
>  >>> from Bio import Translate
>  >>> from Bio.Alphabet import IUPAC
>  >>> my_alpha = IUPAC.unambiguous_dna
>  >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>  >>> standard_translator = Translate.unambiguous_dna_by_id[1]
>  >>> standard_translator.translate(my_seq)
>
> Seq('AIVMGR*KGAR', IUPACProtein())
> which is too much typing for my taste.
>
>
> Questions/comments/suggestions are welcome. None of this has actually been
> coded yet, so it's all still open to discussion.
>
>
> --Michiel.

-- 
Frédéric Sohm
Equipe INRA U1126 "Morphogenèse du système nerveux des Chordés"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47