[Biopython] Indexing large sequence files

Thu Jun 18 18:00:25 UTC 2009

Hi Cedar,

I'm assuming you didn't CC the mailing list accidentally in your reply.

On Thu, Jun 18, 2009 at 6:35 PM, Cedar McKay<cmckay at u.washington.edu> wrote:
>
>> Do you only case about FASTA files? Might you also want to index
>> say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ
>> file?
>
> Right now I only need it for Fasta, but I can easily imagine wanting to do
> something similar with FastQ quite soon. I understand that indexing
> interleaved file formats is much more difficult, but I think it would be
> useful and adequate if SeqIO allowed indexing of any serial file format.

OK.

>> Presumably you need random access to the file (and can't simply use
>> a for loop to treat it record by record).
>
> I do, unless someone can think of something clever. My problem is this:
>
> I have two files, each with 5 million fasta sequences. Most sequences (but
> not all!) in file A have a "mate" in file "B" (and vise versa). My current
> approach is to iterate over file A, using SeqIO.parse, then record by
> record, lookup (using the dictionary like indexed file that we are currently
> discussing) whether the "mate" sequence exists in file B. If it does exist,
> write the pair of sequences (from both A and B) together into file C.

Can you assume the records in the two files are in the same order? That
would allow an iterative approach - making a single pass over both files,
calling the .next() methods explicitly to keep things in sync.

Are you looking for matches based on their identifier? If you can afford
to have two python sets in memory with 5 million keys, then can you do
something like this?:

#Untested. Using generator expressions so that we don't keep all
#the record objects in memory at once - just their identifiers
keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta"))
common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if
rec.id in keys1)
del keys1 #free memory
#Now loop over the files a second time, extracting what you need.
#(I'm not 100% clear on what you want to output)

>> Do you care about the time taken to build the index, the time to access
>> a record, or both?
>
> Truly, I'm not very performance sensitive at this time. I'm simply trying to
> process some files, one way or the other, and the current SeqIO.to_dict
> method just dies altogether on such big files.

Not unexpectedly I would say. Was the documentation or tutorial
misleading? I thought it was quite explicit about the fact SeqIO.to_dict
built an in memory dictionary.

>> Do you expect to actually use most of the records, or just a small
>> fraction?
>
> I use nearly all records.

In that case, the pickle idea I was exploring back in 2007 should work
fine. We would incrementally parse all the records as SeqRecord objects,
pickle them, and store them in an index file. You pay a high cost up front
(object construction and pickling), but access should be very fast. I'll have
to see if I can find my old code... or read up on the python shelve module
before deciding if using that directly would be more sensible.

Peter