[BioPython] Adding startswith and endswith methods to the Seq object

Peter biopython at maubp.freeserve.co.uk
Thu Apr 16 09:10:53 UTC 2009


> Peter wants to have the Seq. startswith function and do stuff like:
>>if record.seq.startswith(primer) :
>>  record = record[crop:]
>
> Leighton would like to have an even more powerful method which would
> do things like:
>>>> Seq("TAG").startswith("TA[CG]")
>
> Which is quite cool, but Peter raises objections to the semantics of
> startswith called with arbitrary strings.
>
> I think that the issue would be resolved if the startswith method
> would not accept strings, but Seqs or Motifs.

Note that the existing search related Seq methods like find, rfind,
split, rsplit already take a string or another Seq object - so I was
intending (with the patch on Bug 2809) that startswith and endswith
did the same.  However, while they take Seq objects like
Seq("TAN",generic_dna), these methods would all still do a blind
search for "TAN" literally, just like a python string would.

Having these Seq object methods all cope with a Motif object is an
interesting idea - I hadn't thought of that.  We can have string or
Seq arguments act as dumb python strings (no ambiguity magic), but
giving a Motif object allows the ambiguity matches to be handled
explicitly.

I would like to clarify that I was thinking more the other way round:
the Motif object has a search method where you give it a Seq (or
string?) to be searched.  Much like Python's regular expression
objects take the target string as an argument.  One advantage of doing
it this way round is the Seq object is kept quite simple (which I
think is a good thing), and all the ambiguity complexity lives in
Bio.Motif instead.

> Assuming that we would have a nice way of generating appropriate
> motifs, it would lead to simple code:
>
> m=Motif.from_IUPAC("TAN")
>
> or alternatively
>
> m=Motif.from_re("TA[C|G]")
>
> s.startswith(m)
>
> Currently there are no methods from_IUPAC or from_re, but it should be
> fairly straightforward to implement them (if there is interest).

I think there is interest - although you might want to have
from_IUPAC_protein, from_IUPAC_DNA, from_IUPAC_RNA.  Just using
m=Motif.from_IUPAC("TAN") it isn't clear if that is protein or DNA.
If Motif.from_IUPAC only took a Seq object with a relevant alphabet
that would solve this ambiguity, but would not be so easy to use.

> writing the startswith method using a motif instance is very straightforward.

If you say so :)

> There is one caveat: implementing complex regexps with Bio.Motif might
> be not as efficient as using regexps directly, but again I could work
> on improving the Motif class.
>
> hope this helps

Let's have a look at this (after Biopython 1.50 is out).

Peter




More information about the Biopython mailing list