[Biopython] Indexing large sequence files
Cedar McKay
cmckay at u.washington.edu
Thu Jun 18 14:54:44 EDT 2009
>> Can you assume the records in the two files are in the same order?
>> That
> would allow an iterative approach - making a single pass over both
> files,
> calling the .next() methods explicitly to keep things in sync.
I can't assume order.
> Are you looking for matches based on their identifier? If you can
> afford
> to have two python sets in memory with 5 million keys, then can you do
> something like this?:
>
I don't have a good sense of whether I can keep 2 * 5million keys in
dictionaries in python. Haven't tried it before.
> #Untested. Using generator expressions so that we don't keep all
> #the record objects in memory at once - just their identifiers
> keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta"))
> common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if
> rec.id in keys1)
> del keys1 #free memory
> #Now loop over the files a second time, extracting what you need.
> #(I'm not 100% clear on what you want to output)
I'll think about this approach more.
> Not unexpectedly I would say. Was the documentation or tutorial
> misleading? I thought it was quite explicit about the fact
> SeqIO.to_dict
> built an in memory dictionary.
The docs were not misleading. I simply don't have a good gut sense of
what is and isn't reasonable using python/biopython. I have written
scripts expecting them to take minutes, and had them run in seconds,
and the other way around too. I was aware that putting 5 million fasta
records into memory was perhaps not going to work, but I thought it
was worth a try.
thanks again for all your personal attention and help.
best,
Cedar
More information about the Biopython
mailing list