[BioPython] The count method of a Seq (or MutableSeq) object

Leighton Pritchard lpritc at scri.ac.uk
Thu Mar 5 13:34:03 UTC 2009


Hi,

On 05/03/2009 12:26, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> We should either:
> 
> (a) stick with the python string compatible behaviour (which has been
> a general principle for the Seq class), but document this issue more
> clearly as a non-overlapping search does run counter to some potential
> biological uses.
> 
> or,
> 
> (b) Or change the behaviour as Leighton suggests to do an overlapping
> search.  This could break any code relying on the old python
> string-like behaviour.
> 
> What do people here think?  Any preferences?

Not surprisingly, I favour (b).

The intended domain of use for Seq is as a proxy for a biological entity and
I think that, just as we extend methods to reflect useful
biologically-themed operations, we should also override methods as
appropriate to reflect those same themes.

I can think of a number of run-of-the-mill use cases where we would want to
know about the count of (potentially) overlapping matches of a subsequence
in a biological sequence, for short sequence repeats (SSRs), restriction
sites, protein sequence motifs, and so on.  Also, if we want simply to test
the expected number of occurrences of the dimer 'AA' in a larger sequence
with a given base composition, a non-overlapping count() method will give a
misleading answer, as it will underreport occurrences of 'AA' in odd-length
runs of consecutive 'A's.  I think that the overlapping approach (b) should
at least be a default setting, even if we choose to make overlap/non-overlap
an argument to the method.

For some searches that potentially could have overlaps we might want to know
what biological question is being asked before choosing which approach to
take.  We may, for example, desire different behaviour from query sequences
like 'AGCCAG' depending on circumstances.  This query on 'AGCCAGCCAG' will
return 1 if there is no overlap is allowed, and 2 if an overlap is allowed.
The same query on 'AGCCAGAGCCAG' will return 2 in both cases.  If we care
about 'AGCCAG' as a restriction site, then we would want an overlapping
search.  If we care about 'AGCCAG' as a simple repeat unit, then we might
want a non-overlapping search instead (assuming that the circumstances of
the search are such that this is a sensible answer).  Having the option
might be useful.

A non-overlapping search might also be useful in those cases where existing
code already corrects for nonintuitive behaviour of count().  This is only
going to apply to code that has been produced since release 1.45, so may
only have limited impact, if any.  I would argue that, since a correction
was needed, by parsimony the original behaviour was probably what required
the change.

On the whole, I think that an overlapping count() is the most intuitive and
most likely use case.  I see that there's an argument for consistency with
string.count(), in that dyed-in-the-wool programmers might find it hard to
shift mental gears from one to the other, but I'm not sure that it's a good
argument, for the following reason.

The following statements are true:

A String is a Python sequence type.  Its count() method returns a
non-overlapping count of the query substring.

A List is a Python sequence type.  Its count() method returns the number of
elements that match the query.

A Tuple is a Python sequence type.  It doesn't have a count() method,
although you might imagine that it could stand to have one.

There isn't any cross-sequence object consistency regarding count().  Should
we choose String-like or List-like behaviour when dealing with a MutableSeq?
I don't think that we should seek consistency with String at the expense of
utility or biological intuition, when:

A Seq/MutableSeq is a (Bio)Python sequence type.  Its count() method returns
the overlapping count of the query substring.

Fits nicely with the other three statements, in that none of them are
consistent with any other ;)

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________



More information about the Biopython mailing list