[BioPython] The count method of a Seq (or MutableSeq) object

Thu Mar 5 16:28:11 UTC 2009

Hi,
This is a little deja vu as I feel this type of thing has come up
before. While I can not speak for anyone else, if I sound different to
that, then I was obviously convinced by those arguments as  that
sounds better than I forgot :-)

More seriously, ignoring the reading fame or the genetic code when
counting is rather bad form!

I can not think of a relevant case involving a protein sequence -
although counting pairs of cysteines in insulin-like sequences could
be a situation of importance (related to disulphide bonds).

An example for nucleic sequences, counting 'TTT' in the madeup
sequence  'TTTTTTTGG' can be two in frames 1 and 2 but only one in
frame 3.

Also, a weaker concern is that the sum of counts is greater than or
equal to the length of the sequence is not desirable property unless
the user is informed that duplicates were found.
In the above case, seven sounds rather wrong when one says that a DNA
sequence of nine DNA bases can produce seven Leucines!

Yes, context is everything because 3 different results is not nice.

Don't get me wrong, I know that finding duplicates is important just
that it should not be here - there must different functions.

Thus, I vote for (a) and I also prefer that default syntax is
consistent with Python language.

If this change is done, then all of Biopython must be revised to be
consistent - like reading frames and similar discussion...

Bruce

On Thu, Mar 5, 2009 at 9:23 AM, Noel O'Boyle <baoilleach at gmail.com> wrote:
> 2009/3/5 Peter <biopython at maubp.freeserve.co.uk>:
>> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>>
>>>
>>> I vote (b).
>>> Another option is to continue to use count() for a Python-style count,
>>> and to add a new method that does a overlapping-type count. For this
>>> new method we'd need a clear but short name, and I can't think of
>>> anything now.
>>>
>>> --Michiel.
>>
>> Did you like plan (c), which preserves the Python string style count
>> as the default but offers the non-overlapping count via an optional
>> argument?
>>
>> i.e.
>>>>> from Bio.Seq import Seq
>>>>> nuc = Seq("AAAA")
>>>>> nuc.count("AA") #default is non-overlapping
>> 2
>>>>> nuc.count("AA", overlap=True)
>> 3
>>>>> nuc.count("AA", overlap=False)
>> 2
>>
>> Peter
>
> I think we are arguing here over which should be the default value.
>
> Several people here believe that behaviour analagous to Python's
> string.count will reduce bug reports and user confusion. However,
> no-one except Leighton has been able to come up with a single use case
> where the current behaviour is useful (and even that example, with
> respect, was flimsy). So we end up with a method with adheres
> magnificently to the principle of least surprise, but which is of no
> use to users. Aren't you trying to provide methods which are useful
> for biological analysis? Isn't that the purpose of wrapping the string
> in the first place?
>
> Noel (getting far too excited over painting this bikeshed)
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>