[BioPython] Question about Seq.count()

Thu Oct 18 10:22:22 UTC 2007

Jimmy Musselwhite wrote:
> Just kidding, it didn't work great. It only "fixed" it because I was
> printing out the output of count() and so it was just executing 100 times
> slower and thus eating RAM 100 times slower :(
> 
> It doesn't seem like there is a good way for me to fix this.

Both of these are using the python string method to count "GG", the only 
difference is the tostring() method has the additional small overhead of 
an extra function call:

my_seq.data.count("GG")
my_seq.tostring().count("GG")

However, comparing these:

my_seq.data.count("G")         # using python's string count method
my_seq.tostring().count("G")   # using python's string count method
my_seq.count("G")              # using an iterator internally

It could be that the Seq record's current single letter search is simply 
very memory efficient compared than the python string's more flexible 
multi-letter search.

How are you measuring the RAM?  If like to see memory usage figures for 
the five simple examples above on a large sequence - plus doing this 
directly on the equivalent string.

Are you using Linux or Windows or Mac OS, and what version of python?  I 
know there have been some string optimisations in Python 2.5 (although I 
don't know if any are relevant to the count method).

Peter