[BioPython] Question about Seq.count()

Fri Oct 19 13:38:50 UTC 2007

2007/10/18, Peter <biopython at maubp.freeserve.co.uk>:
>  >>> from Bio.Seq import Seq
>  >>> my_seq = Seq("AAACACACGGTTTT")
>  >>> my_seq.count("G")
> 2
>  >>> my_seq.count("GG")
> 0

I've found the bug!

The code for Bio.Seq.count is:

                                         def count(self, item):
        return len([x for x in self.data if x == item])

it does not work for patterns of two nucleotides, because '[x for x in
self.data]' reiterates on a list of strings of one letter each:

>>> s = Seq( 'ACTTgGCATYCGgtGACGACTGGGcATCGGTCAGTCGGTTT')
>>> [x for x in s.data]
['A', 'C', 'T', 'T', 'g', 'G', 'C', 'A', 'T', 'Y', 'C', 'G', 'g', 't',
'G', 'A', 'C', 'G', 'A', 'C', 'T', 'G', 'G', 'G', 'c', 'A', 'T', 'C',
'G', 'G', 'T', 'C', 'A', 'G', 'T', 'C', 'G', 'G', 'T', 'T', 'T']
>>> for x in s.data:
>>>     print x, 'GG', x == 'GG'
(always false)

Something like [len('GG' in s.data)] also won't work, because "'GG' in
s.data" returns a Boolean value:
>>> 'GG' in s.data
True

What about using regular expressions instead?

>>> import re
>>> r = re.compile('GG')
>>> count = len(r.findall(my_seq.data))

They don't seem to be too different as for the execution time:

# for i in $( seq 10); do time python -m re -c '"cdasd".count("cc")';
done 2>&1| grep real
real    0m0.091s
real    0m0.106s
real    0m0.081s
real    0m0.110s
real    0m0.076s
real    0m0.109s
real    0m0.109s
real    0m0.062s
real    0m0.110s
real    0m0.062s

# for i in $(seq 10); do time python -m re -c 'len(re.findall("cc",
"cdasd"))'; done 2>&1|grep real
real    0m0.065s
real    0m0.108s
real    0m0.079s
real    0m0.082s
real    0m0.111s
real    0m0.113s
real    0m0.110s
real    0m0.112s
real    0m0.112s
real    0m0.111s

Compiling a short pattern with the re module shouldn't take too much
time and maybe in future implementations, it will allows us to do more
interesting things: for example, we will be able to add an
'ignorecase' parameter to Seq.count:

>>> Bio.Seq('ACAGtcAGgCATGCGG').count('GG', 'ignorecase')
2
>>> Bio.Seq('ACAGtcAGgCATGCGG').count('GG')
1

What do you think?

Cheers,
Giovanni

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com