[Biopython-dev] sequence class proposal

Blanca Postigo Jose Miguel jblanca at btc.upv.es
Mon May 26 05:24:30 UTC 2008


> One of your points seemed to be that the SeqRecord couldn't have a
> __getitem__ and methods like reverse, complement, etc.  I don't see
> why it couldn't have these.  Perhaps rather than introducing a whole
> new class, enhancing the SeqRecord would be a better avenue.
My main concern with SeqRecord is that is has a Seq, it we want a slice or a
reverse we would do:
my_seq = SeqRecord(Seq('ACTGTGAC'))
myseq.seq[1:5]
myseq.seq.reverse()
If we add to SeqRecord residues annotations (like qualities) how could be
reversed if we are calling directly to the .seq.reverse(). I don't know how
could this work.
my_seq = SeqRecord(Seq('ACTG'), Qual([10,20,30,40]))
myseq.seq.reverse()
It would create a non-valid sequence
str(myseq.seq) -> 'GTCA'
str(myseq.qual) -> [10,20,30,40]
One possibility is to have methods like __getitem__ and in Seq, it would be
like:
my_seq.seq[1:3]
my_seq[1:3]
Just for testing I have done a RichSeq that is compatible with Seq and
SeqRecord, but that's very confusing. Does this SeqRecord HAS or IS a sequence?
It could work, but I feel that is wrong and it is easier to explain to the users
that a new improved SeqRecord has been created (RichSeq) and that they should
migrate to that.
Another problem difficult to solve. If RichSeq is compatible with Seq as Michel
wants to and I agree on that, how it could be compatible with SeqRecord. The
parameters in their constructors are not compatible:
SeqRecord(seq, ...)
Seq(data, alphabet...)
I would happily improve on RichSeq, but I don't know how to do it in a sane way.
What do you think?

>
> Also, I do think we should bear in mind the BioSQL sequence
> representation, which we currently expose in a SeqRecord/Seq like way.
>  I wouldn't want to lose this / have to completely re-write the
> Biopython BioSQL code.
I would look into that.
Best regards,

Jose Blanca


>
> Peter
>
> On Sun, May 25, 2008 at 9:12 PM, Blanca Postigo Jose Miguel
> <jblanca at btc.upv.es> wrote:
> > Dear biopythonistas:
> > First of all my apologize for the MutableSeq reimplementation. I did it
> just for
> > the sake of learning more about python and Biopython, not to achive a
> speedier
> > implementation. It has been a good learning exercise for me, but now let's
> go
> > for the meat...
> >
> > Everything that follows is just my opinion on the sequence classes. Mine is
> not
> > a well informed opinion and I would just like to show my ideas to you to
> get
> > some feed back and to learn from you.
> >
> > Since this sequence class remodelation is a complex topic I would like to
> > explain my ideas about it with some order. I won't enter into
> implementation
> > details, I will just discuss the API of the classes.
> > I think that Seq and MutableSeq are pretty ok, although MutableSeq has some
> > extra method that depends on implementation and are not relevant for a
> sequence
> > class (append, insert, pop, remove). In general Seq and MutableSeq should
> have
> > the same API, that would do their use simpler.
> >
> > I think that the main problem is SeqRecord. SeqRecord IS NOT a sequence it
> HAS a
> > sequence, that's its main flaw. A more capable Seq class should be a Seq.
> My
> > proposal is to create a RichSeq that inherits from Seq and a MutableRichSeq
> > that inherits from MutableSeq. I've been doing some coding and some
> thinking
> > about that. I'm discussing this with you, because I would like to improve
> the
> > desing of the API of such sequence and I could implement it. It's main
> desing
> > guidelines would be:
> > - Compatible with Seq or with MutableSeq. Everytime that you can use a Seq
> class
> > you can also use a more capable RichSeq without changing anything in your
> > program.
> > - RichSeq IS a Seq, it inherits from Seq.
> > - RichSeq is similar to SeqRecord, but they aren't compatible.
> >        The SeqRecord constructor is:
> >    def __init__(self, seq, id = "<unknown id>", name = "<unknown name>",
> >                 description = "<unknown description>", dbxrefs = None,
> >                 features = None):
> >        and the RichSeq one maybe:
> >    def __init__(self, seq=None, alphabet = None,
> >                 id = "<unknown id>", name = "<unknown name>",
> >                 description = "<unknown description>", dbxrefs = None,
> >                 features = None):
> >        RichSeq has a seq(or could be data) and an alphabet (like the Seq
> class) while
> > SeqRecord has a Seq object.
> >        RichSeq would not have a .seq property.
> > - RichSeq has a __getitem__ method capable of things like RichSeq[1:2]. And
> it
> > would also had the methods reverse, complement, etc.. That's not possible
> with
> > SeqRecord.
> > - RichSeq should be a new type class, what about Seq and MutableSeq?
> > - From a Michel's comment:
> >        1) A Seq object is basically a string, so it should behave as if it
> were
> >        subclassed from string.
> >        2) As a result, functions that have a sequence as an argument, but
> don't need
> >        the added features of a Seq object, should work with strings as well
> as Seq
> >        objects.
> >        4) Currently, Seq objects have an associated alphabet; SeqRecord
> objects have
> >        annotations, dbxrefs, a description, features, id, and name. I think
> a new Seq
> >        object should have both, so that we can avoid having both a Seq and
> a SeqRecord
> >        class. Of course, some or all of these fields can remain None. (I
> would add,
> > that even the seq could be None)
> > If biopython had a class like RichSeq I wouldn't use SeqRecord. Also, the
> > transition from using SeqRecord to RichSeq would be very easy and both
> classes
> > could coexist as long as you would like.
> > Also using the features the per-residue annotation is very easy to
> implement. In
> > fact I have done it already using a RichFeature class, but I would discuss
> that
> > in other mail.
> > RichSeq is more easy to extend than SeqRecord, that's its main advantage. I
> have
> > pretty wild plans for a class like RichSeq. A class like SeqWithQuality or
> the
> > Bio::Seq::MetaI from Bioperl would be very easy to derive from RichSeq. The
> > would be just easier interfaces to the more capable and general RichSeq.
> Even
> > Alignment would be derived from RichSeq. An Alignment IS a sequence with
> > subsequences in it. I have also implemented a prototype of that and it work
> > quite ok with very like coding.
> > This are the more general remarks about RichSeq. What do you think? Is a
> good
> > idea to go beyond SeqRecord for biopython? Could be something like RichSeq
> a
> > possible way to do it?
> >
> > Now I would like to list the open discussion points regarding the sequence
> class
> > APIs.
> > - annotations is not in the constructor of SeqRecord. There's two options:
> add
> > it to the RichSeq constructor or remove it altogether. In my implementation
> a
> > feature can span the whole sequence length or can have a range attached. In
> > this way annotations are just a special case of featues. We would have to
> > decide between dict and list for the API.
> >
> > - __getitem__ should always return a RichSeq. It's more consistent to
> return the
> > same for a_seq[1:2] and a_seq[1]. If someone wants a character can do
> > str(seq)[1].
> >
> > - no seq property in RichSeq.
> >
> > - with __str__ is enough, so tostring() is not necessary for more complex
> > representations we have __repr__. tostring()could be kept for compatibility
> > with the Seq and MutableSeq API.
> >
> > - What to do with id, name and the str annotations when a slice is
> requested? If
> > seq.name is 'a_sequence' should seq[1:10].name be 'a_sequence' or
> 'a_sequence
> > [1:10]' or ''? Same problem with add and __radd__.This is a problem, but
> some
> > of the three alternatives should be taken and explained in the
> documetation. A
> > better solution is in my RichFeature class, but I wouldn't discuss it now.
> >
> > - __iter__ iterates over the sequence as a character string.
> >
> > - __add__ and __radd__
> >
> > - .upper(), .count(), .lower()
> >
> > - .data property. I think that this is an implemetation detail and it
> should be
> > deprecated from Seq and MutableSeq.
> >
> > Well, that's all sorry for the long mail. I'm enjoing working on this
> problem
> > and learning from you.
> > Best regards,
> >
> > Jose Blanca
> >
> >
>


-- 




More information about the Biopython-dev mailing list