[Biopython-dev] SeqIO.InterlacedSequenceIterator

Thu Dec 13 16:29:19 UTC 2012

Hi !

I'm working a lot with fasta files. They can be large (>50GB) and contain lots of sequences (>40,000,000). Often I need to get one sequence from the file. WIth a flat FASTA file this requires parsing, on average, half of the file before finding it. I would like to write something that solves this problem, and rather than making a new repository, I thought I could contribute to biopython.

As I just wrote, the iterator nature of parsing sequences files has it's limits. I was thinking of something that is indexed. And not some hack like I see sometimes where a second".fai" file is added nest to the ".fa" file. The natural thing to do is to put these entries in a SQLite file. The appraisal of such solutions is well made here: http://defindit.com/readme_files/sqlite_for_data.html

Now I looked into the biopython source code, and it seems everything is based on returning a generator object which essentially has only one method: next() giving SeqRecords. For what I want to do, I would also need the get(id) method. Plus any other methods that could now be added to query the DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is a class called InterlacedSequenceIterator(SequenceIterator) that contains a __getitem__(i) method, but it's unclear how to I should go about implementing that. Any help/example on how to add such a format to SeqIO ?

Thanks !

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University