[Biopython-dev] UniProt GOA parser
Iddo Friedberg
idoerg at gmail.com
Fri May 10 12:32:43 EDT 2013
On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
> >>
> >> Would it make sense to want random access to the GOA files based
> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> >> should be fairly straight forward to do building on the indexing code
> >> for Bio.SeqIO and SearchIO.
> >
> >
> > Would that require reading it all into memory? Uniprot_GOA files
> > are huge, it is impractical to read them in fully.
>
> Not at all - we'd record a dictionary mapping the record ID to an offset
> in the file on disk, or record this mapping in an SQLite index file.
>
Ok, that's good then
> >> Note here I am picturing combining all the (consecutive) lines
> >> for the same DB_Object_ID - currently the parser is line based,
> >> but batching by DB_Object_ID would be a straightforward change
> >> and may better suit some uses.
> >
> > Perhaps only for organism specific file, which in some cases can
> > be read fully into memory.
>
> The examples I looked at only seemed to have a dozen or so
> lines for each DB_Object_ID - but perhaps these were easy
> cases? How many lines per DB_Object_ID in the worst cases?
>
> Peter
>
I was actually thinking you are suggesting that the whole file should be
read in memory, nit just buffer by DB-Object_ID. My mistake.
--
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.
More information about the Biopython-dev
mailing list