[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Thu Aug 20 09:58:03 EDT 2009

I just have two suggestions:

Since indexed_dict returns a dictionary-like object, it may make sense for the _IndexedSeqFileDict to inherit from a dict.

Another issue is whether we can fold indexed_dict and to_dict into one.
Right now we have 

def to_dict(sequences, key_function=None) :

def indexed_dict(filename, format, alphabet=None) :

What if we have a single function "dictionary" that can take sequences, a handle, or a filename, and optionally the format, alphabet, key_function, and a parameter "indexed" that indicates if the file should be indexed or kept into memory? Or something like that.

Otherwise, the code looks really nice. Thanks!

--Michiel

--- On Thu, 8/20/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
> To: "Biopython-Dev Mailing List" <biopython-dev at lists.open-bio.org>
> Cc: "Cedar McKay" <cmckay at u.washington.edu>
> Date: Thursday, August 20, 2009, 7:28 AM
> Hi all,
> 
> You may recall a thread back in June with Cedar Mckay (cc'd
> - not
> sure if he follows the dev list or not) about indexing
> large sequence
> files - specifically FASTA files but any sequential file
> format. I posted
> some rough code which did this building on Bio.SeqIO:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html
> 
> I have since generalised this, and have something which I
> think
> would be ready for merging into the main trunk for wider
> testing.
> The code is on github on my "index" branch at the moment,
> http://github.com/peterjc/biopython/commits/index
> 
> This would add a new function to Bio.SeqIO, provisionally
> called
> indexed_dict, which takes two arguments: filename and
> format
> name (e.g. "fasta"), plus an optional alphabet. This will
> return a
> dictionary like object, using SeqRecord identifiers as
> keys, and
> SeqRecord objects as values. There is (deliberately) no way
> to
> allow the user to choose a different keying mechanism
> (although
> I can see how to do this at a severe performance cost).
> 
> As with the Bio.SeqIO.convert() function, the new addition
> of
> Bio.SeqIO.indexed_dict() will be the only public API
> change.
> Everything else is deliberately private, allowing us the
> freedom
> to change details if required in future.
> 
> The essential idea is the same my post in June. Nothing
> about
> the existing SeqIO framework is changed (so this won't
> break
> anything). For each file we scan though it looking for new
> records,
> not the file offset, and extract the identifier string.
> These are stored
> in a normal (private) Python dictionary. On requesting a
> record, we
> seek to the appropriate offset and parse the data into a
> SeqRecord.
> For simple file formats we can do this by calling
> Bio.SeqIO.parse().
> 
> For complex file formats (such as SFF files, or anything
> else with
> important information in a header), the implementation is a
> little
> more complicated - but we can provide the same API to the
> user.
> 
> Note that the indexing step does not fully parse the file,
> and
> thus may ignore corrupt/invalid records. Only when (if)
> they are
> accessed will this trigger a parser error. This is a shame,
> but
> means the indexing can (in general) be done very fast.
> 
> I am proposing to merge all of this (except the SFF file
> support),
> but would welcome feedback (even after a merger). I
> already
> have basic unit tests, covering the following SeqIO file
> formats:
> "ace", "embl", "fasta", "fastq" (all three variants),
> "genbank"/"gb",
> "ig", "phd", "pir", and "swiss" (plus "sff" but I don't
> think that
> parser is ready to be checked in yet).
> 
> An example using the new code, this takes just a few
> seconds
> to index this 238MB GenBank file, and record access is
> almost
> instant:
> 
> >>> from Bio import SeqIO
> >>> gb_dict = SeqIO.indexed_dict("gbpln1.seq",
> "gb")
> >>> len(gb_dict)
> 59918
> >>> gb_dict.keys()[:5]
> ['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1',
> 'AB161026.1']
> >>> record = gb_dict["AB433451.1"]
> >>> print record.id, len(record),
> len(record.features)
> AB433451.1 590 2
> 
> And using a 1.3GB FASTQ file, indexing is about a minute,
> and
> again, record access is almost instant:
> 
> >>> from Bio import SeqIO
> >>> fq_dict =
> SeqIO.indexed_dict("SRR001666_1.fastq", "fastq")
> >>> len(fq_dict)
> 7047668
> >>> fq_dict.keys()[:4]
> ['SRR001666.2320093', 'SRR001666.2320092',
> 'SRR001666.1250635',
> 'SRR001666.2354360']
> >>> record = fq_dict["SRR001666.2765432"]
> >>> print record.id, record.seq
> SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>