[Biopython-dev] Merging Uniprot XML parser?
Peter
biopython at maubp.freeserve.co.uk
Fri Nov 5 17:53:50 UTC 2010
On Fri, Nov 5, 2010 at 4:43 PM, Andrea Pierleoni wrote:
>
> On Tue, Oct 19, 2010 at 4:54 PM, Peter wrote:
>> I've now merged this into the trunk (with a git rebase first so the
>> history is linear - no branch+merge), and Andrea has agreed to
>> retest it. Other testing and comments are most welcome.
>>
>> Peter
>>
>
>
> I've done a couple of testing, from the master biopython branch.
> The uniprot-xml parser successfully parsed the 2010_11 release
> of uniprot containing 522,019 entries.
>
> The plain text 'swiss' parser took 6 mins to parse the complete flatfile
> uniprot db on my system (python 2.6 on a macbook pro, core2duo).
> the uniprot-xml parser took 12 minutes to do the same task when using
> cElementTree and looks pretty good to me (compare this to the 8
> minutes I needed to download the gzipped db).
I think I have a slightly older version as it only has 519348 entries.
My timings using Python 2.6 on Mac OS X, using looping over the
file with Bio.SeqIO.parse() and incrementing a counter:
uniprot_sprot.fasta, 232 MB, 15s ("fasta")
uniprot_sprot.dat, 2.2 GB, 4m57s ("swiss")
uniprot_sprot.xml, 4.5 GB, 10m34s ("uniprot-xml")
Note the XML file is about twice the size of the plain text swiss
format file, and as you noted, takes about twice as long to parse.
> However it took more than 80 mins to do the same task using
> ElementTree. So be aware that the parser can turn very slow
> without the C library.
>
> I'm currently retesting also on TrEMBL, but I don't think there is going
> to be any problem.
OK - those files are about 10 times bigger, right?
> I have no idea of the performances with jython, and similar
> derivations of python, nor if it works.
The tests all pass with Jython 2.5.1 (running under Mac OS X),
and here are some timings:
uniprot_sprot.fasta, 232 MB, 21s ("fasta")
uniprot_sprot.dat, 2.2 GB, 8m34s ("swiss")
uniprot_sprot.xml, 4.5 GB, FAILED ("uniprot-xml")
The XML file failed almost immediately with this traceback:
Traceback (most recent call last):
File "../count.py", line 13, in <module>
for record in SeqIO.parse(open(filename), format_name):
File "../count.py", line 13, in <module>
for record in SeqIO.parse(open(filename), format_name):
File "/Users/xxx/jython2.5.1/Lib/site-packages/Bio/SeqIO/UniprotIO.py",
line 80, in UniprotIterator
for event, elem in ElementTree.iterparse(handle, events=("start", "end")):
File "/Users/xxx/jython2.5.1/Lib/xml/etree/ElementTree.py", line 937, in next
self._parser.feed(data)
File "/Users/xxx/jython2.5.1/Lib/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
File "/Users/xxx/jython2.5.1/Lib/xml/parsers/expat.py", line 195, in Parse
self._data.append(data)
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space
Note this wasn't a simple out of memory error (the machine had GBs
free), rather it was heap space. That's a bit frustrating - but Kyle's
email suggests things could improve in the next Jython release.
Peter
More information about the Biopython-dev
mailing list