[Biopython-dev] UniProt GOA parser

Peter Cock p.j.a.cock at googlemail.com
Wed May 22 09:45:00 EDT 2013


On Mon, May 20, 2013 at 7:09 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> Do you want to have a go at re-using the index code in Bio.File
>> (the back end for SeqIO and SearchIO's indexing)? Let me know
>> if the current setup is too mysterious and I can try and document
>> more of it and/or do this for the GOA module.
>
> I'd like to have a go..
>
> ./I

Great - a few more details then,

The second part of Bio/File.py has some private classes
_IndexedSeqFileProxy and _IndexedSeqFileDict and
_SQLiteManySeqFilesDict which can be used for any
sequential record file format (meaning one after the other,
not just biological sequences).

These are used by the Bio.SeqIO.index() and index_db()
functions, and their sisters in Bio.SearchIO.

The idea is you write a subclass of _IndexedSeqFileProxy
for your new file format, and then this gets used by either
_IndexedSeqFileDict (in memory offset dictionary) or
_SQLiteManySeqFilesDict (SQLite offset dictionary).

Your _IndexedSeqFileProxy subclass has to define
an __iter__ method which loops over the file giving
a tuple for each record giving the identifier string
and the start offset, and ideally the length in bytes.
It must also define a get method which must seek
to the offset and then parse the record.

For the GOA files, the __iter__ loop will just spot
batches of lines for the same identifier which together
make up a single record.

I managed to explain the setup to Bow, and he got it
to work for SearchIO, but we were doing face to face
video chats for that during GSoC last year. Fresh eyes
will surely find some more rough edges in my docs ;)

Regards,

Peter


More information about the Biopython-dev mailing list