[Biopython] Indexing large sequence files

Thu Jun 18 14:54:44 EDT 2009

>> Can you assume the records in the two files are in the same order?  
>> That
> would allow an iterative approach - making a single pass over both  
> files,
> calling the .next() methods explicitly to keep things in sync.
I can't assume order.

> Are you looking for matches based on their identifier? If you can  
> afford
> to have two python sets in memory with 5 million keys, then can you do
> something like this?:
>
I don't have a good sense of whether I can keep 2 * 5million keys in  
dictionaries in python. Haven't tried it before.

> #Untested. Using generator expressions so that we don't keep all
> #the record objects in memory at once - just their identifiers
> keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta"))
> common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if
> rec.id in keys1)
> del keys1 #free memory
> #Now loop over the files a second time, extracting what you need.
> #(I'm not 100% clear on what you want to output)
I'll think about this approach more.

> Not unexpectedly I would say. Was the documentation or tutorial
> misleading? I thought it was quite explicit about the fact  
> SeqIO.to_dict
> built an in memory dictionary.
The docs were not misleading. I simply don't have a good gut sense of  
what is and isn't reasonable using python/biopython. I have written  
scripts expecting them to take minutes, and had them run in seconds,  
and the other way around too. I was aware that putting 5 million fasta  
records into memory was perhaps not going to work, but I thought it  
was worth a try.

thanks again for all your personal attention and help.

best,
Cedar