[BioPython] The count method of a Seq (or MutableSeq) object

Peter biopython at maubp.freeserve.co.uk
Thu Mar 5 13:26:10 UTC 2009


On Thu, Mar 5, 2009 at 1:11 PM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> +1 for (b)
>
> Seq.count() should behave like a biological sequence.
>
> Here's an example in the wild of this type of analysis:
> http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14
>
> It's from a bioinformatics textbook with example code in Matlab. I was
> helping a colleague who was trying to reproduce the analysis with
> BioPython. Everything was fine until the dimer frequencies were found
> to disagree. After implementing the count ourselves, we were able to
> reproduce the results. It was then we realised that BioPython was
> behaving in an unexpected and non-useful way.

I agree that in this context it is not useful to have the Seq object
count do an non-overlapping search.

However, calling it "unexpected" is debatable, and could probably
depend on the user's background background.  If you already know
Python before using Biopython, I would argue that the non-overlapping
search is expected because that is what python strings do.  On the
other hand, I'm sure many Biopython users learn Python and Biopython
together - and one might still argue having strings and Seq objects do
different things is unexpected.

Overall between options (a) and (b), I'd pick consistency with the
python string (a), even if it isn't ideal.

There is another idea, let's call this option (c).  Give the Seq
object's count method an optional boolean argument to enable an
overlapping search (which I would want to default to matching the
python string behaviour).  This makes switching between string and Seq
objects easier, and makes the more useful (but probably slower)
overlap aware count option quite accessible and discoverable.

Peter



More information about the Biopython mailing list