[BioPython] Question about Seq.count()

Jimmy Musselwhite jimmy.musselwhite at gmail.com
Wed Oct 17 21:20:41 UTC 2007


Hello all
I have a script that is running through a list of about 250,000 sequence
records and counting the number of times it counts substrings of 3-5
nucleotides in length

Here is some example code

search = 'ATTCG'

#use SeqIO to get a big list of records
sequences = list(SeqIO.parse(file, "fasta")

for record in sequences :

Now the code I want to do is
record.seq.count(search)

but what I am forced to do is
record.seq.tostring().count(search)

The problem here is that when I am forced to use .tostring() on every single
seq object it devastates my memory usage in a BIG way. It eats up about
1.2gigs and then crashes. If I remove the .tostring() and just tell if to
search for 'A', it will run fine and use memory at about 1/100th the rate

So my question sums down to, is there any way to make .count() be able to
search for strings and not just characters? Otherwise my work is going to
grind to a halt here.

Thanks!



More information about the Biopython mailing list