[Biopython-dev] [Bug 2779] Seq.count() docstring should note unexpected behaviour

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Mar 5 09:23:31 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2779





------- Comment #1 from lpritc at scri.sari.ac.uk  2009-03-05 04:23 EST -------
I think that's a good point about expected behaviour for count() in a
biological sequence.  Presumably, we all expect that Seq('GGG').count('GG')
should find all overlapping matches, and return the value 2, in order to make
intuitive 'biological' sense.  There are, after all, two 'GG's in that
sequence.  This doesn't correspond to string count()ing behaviour, or to
standard re module behaviour.

The obvious way round it, that I've used before, is to compile the search
string as a regular expression, and iterate regular expression matches from one
symbol after the start of the preceding match (if any):

>>> import re
>>> startpos = 0
>>> seq = 'GGGG'
>>> motif = 'GG'
>>> motif_re = re.compile(motif)
>>> matches = []
>>> while True:
...     m = motif_re.search(seq, startpos)
...     if m is None:
...             break
...     startpos = m.start() + 1
...     matches.append(m)
... 
>>> matches
[<_sre.SRE_Match object at 0x68f38>, <_sre.SRE_Match object at 0x96ac60>,
<_sre.SRE_Match object at 0x96a950>]
>>> [(m.start(), m.group()) for m in matches]
[(0, 'GG'), (1, 'GG'), (2, 'GG')]

This could probably be done more efficiently.  Is something like this already
implemented in Bio.Motif


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list