[Biopython-dev] SeqIO.InterlacedSequenceIterator

Peter Cock p.j.a.cock at googlemail.com
Thu Dec 13 17:40:46 UTC 2012


On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
> Hi !
>
> I'm working a lot with fasta files. They can be large (>50GB) and contain
> lots of sequences (>40,000,000). Often I need to get one sequence from the
> file. WIth a flat FASTA file this requires parsing, on average, half of the
> file before finding it. I would like to write something that solves this
> problem, and rather than making a new repository, I thought I could
> contribute to biopython.
>
> As I just wrote, the iterator nature of parsing sequences files has it's
> limits. I was thinking of something that is indexed. And not some hack like
> I see sometimes where a second".fai" file is added nest to the ".fa" file.
> The natural thing to do is to put these entries in a SQLite file. The
> appraisal of such solutions is well made here:
> http://defindit.com/readme_files/sqlite_for_data.html
>
> Now I looked into the biopython source code, and it seems everything is
> based on returning a generator object which essentially has only one method:
> next() giving SeqRecords. For what I want to do, I would also need the
> get(id) method. Plus any other methods that could now be added to query the
> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
> __getitem__(i) method, but it's unclear how to I should go about
> implementing that. Any help/example on how to add such a format to SeqIO ?
>
> Thanks !

Have you looked at Bio.SeqIO.index (index held in memory) and
Bio.SeqIO.index_db (index held in an SQLite3 database), and do
they solve your needs?

Note these only index the location of records - unlike tabix/fai indexes
which also look at the line length to be able to pull out subsequences.
This means the Bio.SeqIO indexing isn't ideal for dealing with large
records were you are only interested in small subsequences.

Peter



More information about the Biopython-dev mailing list