[Biopython-dev] SeqIO.InterlacedSequenceIterator

Lucas Sinclair lucas.sinclair at me.com
Fri Dec 14 10:07:55 UTC 2012


Hello,

Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes an index, but it is held in memory. So it must be recomputed every time the interpreter is reloaded. This step is wasting enough time for me that I would like to compute the index on my 50GB file once, and then be done with it. SQLite really is the technology of choice for such a problem... I suppose you agree storing all this sequence information in flat ascii files is not piratical.

Actually, I found a reasonable work around way of achieving this result with these two commands:
$ formatdb -i reads -p T -o T -n reads
$ blastdbcmd -db reads -dbtype prot -entry "105107064179" -outfmt %f -out test.fasta

But then I need to have calls to subprocess...

Since, I thought my first small contribution to biopython was fun doing, (https://github.com/biopython/biopython/commit/1c72a63b35db70d11c628b83a0269d1a9c6443a4) I maybe still fell like writing a proper solution. Would such a thing be a welcome addition to Bio.SeqIO ? If so, where would I place it ? The schema would be a SQLite file with a single table named "sequences". This table would have columns corresponding to the attributes of a SeqRecord. But you would need to get a different type object back when calling parse than a generator, you would need an object that has a __getitem__ method.

Sincerely,

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University

On 13 déc. 2012, at 18:40, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>> Hi !
>> 
>> I'm working a lot with fasta files. They can be large (>50GB) and contain
>> lots of sequences (>40,000,000). Often I need to get one sequence from the
>> file. WIth a flat FASTA file this requires parsing, on average, half of the
>> file before finding it. I would like to write something that solves this
>> problem, and rather than making a new repository, I thought I could
>> contribute to biopython.
>> 
>> As I just wrote, the iterator nature of parsing sequences files has it's
>> limits. I was thinking of something that is indexed. And not some hack like
>> I see sometimes where a second".fai" file is added nest to the ".fa" file.
>> The natural thing to do is to put these entries in a SQLite file. The
>> appraisal of such solutions is well made here:
>> http://defindit.com/readme_files/sqlite_for_data.html
>> 
>> Now I looked into the biopython source code, and it seems everything is
>> based on returning a generator object which essentially has only one method:
>> next() giving SeqRecords. For what I want to do, I would also need the
>> get(id) method. Plus any other methods that could now be added to query the
>> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>> __getitem__(i) method, but it's unclear how to I should go about
>> implementing that. Any help/example on how to add such a format to SeqIO ?
>> 
>> Thanks !
> 
> Have you looked at Bio.SeqIO.index (index held in memory) and
> Bio.SeqIO.index_db (index held in an SQLite3 database), and do
> they solve your needs?
> 
> Note these only index the location of records - unlike tabix/fai indexes
> which also look at the line length to be able to pull out subsequences.
> This means the Bio.SeqIO indexing isn't ideal for dealing with large
> records were you are only interested in small subsequences.
> 
> Peter




More information about the Biopython-dev mailing list