[Biopython-dev] UniProt GOA parser

Iddo Friedberg idoerg at gmail.com
Fri May 10 16:20:16 UTC 2013


On Fri, May 10, 2013 at 6:06 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > A new uniprot-GOA parser is available for you to poke around:
> >
> > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
> >
>
> I think for the namespace, we might be better off using Bio.UniProt.GOA,
> where Iddo's parser would be in Bio/UniProt/GOA.py and any other
> UniProt specific code could also go under Bio/UniProt - for example
> a web API.
>

OK.


>
> Some of Bio.SwissProt might also migrate here over time.
>
> > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
> >
> > There are three file formats: GAF (gene association file) , GPA (gene
> > product association) and GPI (gene product information) explained here:
> > http://www.ebi.ac.uk/GOA/downloads
> >
> > Input GAF files can be very large, due to the growth of uniprot GOA. If
> you
> > would like to test in a timely fashion, I suggest you get historical
> files,
> > which are smaller. Once you get to the > 40 version numbers, the runtime
> > for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> > machine).
>
> Would it make sense to want random access to the GOA files based
> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> should be fairly straight forward to do building on the indexing code
> for Bio.SeqIO and SearchIO.
>

Would that require reading it all into memory? Uniprot_GOA files are huge,
it is impractical to read them in fully.


>
> Note here I am picturing combining all the (consecutive) lines
> for the same DB_Object_ID - currently the parser is line based,
> but batching by DB_Object_ID would be a straightforward change
> and may better suit some uses.
>

Perhaps only for organism specific file, which in some cases can be read
fully into memory.

>
> > Old GAF files are available here:
> > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
> >
> > Current GPI and GPA files are not very large.
> >
> > Thanks to Peter for his help on this.
> >
> > Best,
> >
> > Iddo
>
> Peter
>



-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.



More information about the Biopython-dev mailing list