[Biopython-dev] Python 3 and Bio.SeqIO.index()
Peter
biopython at maubp.freeserve.co.uk
Wed Jul 14 17:47:45 UTC 2010
Hi all,
>From background reading I knew that text IO speed was very slow in
Python 3.0, but this had been improved in Python 3.1 - however there
was still an overhead for the unicode conversion. e.g.
http://dabeaz.blogspot.com/2010/01/reexamining-python-3-text-io.html
First some good news - using Bio.SeqIO.convert() for FASTQ to
FASTA seems to be faster under Python 3.1 than Python 2.7 (on
a Windows XP 32bit machine).
Now for the bad news - using Bio.SeqIO.index() is much slower. I
decided to simplify this down to a minimal test case, and confirmed
my hunch: indexing files in the new default unicode text mode comes
with a major time penalty (a factor of about one hundred).
I've attached four versions of the same script which scans a FASTA
file building a dictionary of record offsets.
* fast in Python 2 using the default non-unicode strings
* slower in Python 3 using the default unicode strings
* slower in Python 3 using Latin encoded unicode strings
* faster in Python 3 using binary mode and bytes
The basic Python 3 script was created using 2to3 from the Python 2
version. I manually changed this to make the latin variant, and the
binary bytes version.
Sample output on an example file with just 94 entries:
c:\python27\python index2.py ls_orchid.fasta - Indexed in 0.02s
c:\python31\python index3.py ls_orchid.fasta - Indexed in 12.20s
c:\python31\python index3latin.py ls_orchid.fasta - Indexed in 11.78s
c:\python31\python index3b.py ls_orchid.fasta - Indexed in 0.02s
Here the Python 2 version and the Python 3 binary examples
are both extremely fast, while Python 3 unicode is very slow.
There may be a tiny benefit to using the Latin encoding as
suggested on the blog post I linked to above.
Using a FASTA file with 7 million entries (converted from SRA
entry SRR001666_1.fastq), we have:
c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s
c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it)
c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s
I think the reason that Python 3 binary is faster than Python 2
is we are using universal read lines mode in Python 2, which will
add an overhead (both for reading, and in calculating the offset).
Given the way the Bio.SeqIO.index() API works, we have control over
the file mode. I think we are going to have to open the file in binary
mode for indexing efficiently. This may mean an extra wrapper for
handling cross platform new line characters (something that Python
2.x does for us).
I'd also be interested to try making the optimized functions in
Bio.SeqIO.convert() use binary mode too and see if that makes
them any faster (even on Python 2).
In general, perhaps it would be useful if on Python 3 Bio.SeqIO
could cope with opening text files in either unicode text mode
or in binary mode? These issues may also influence what we
decide to use for Seq objects by default (bytes versus unicode).
Of course, the more special cases like this we have to worry
about, the more complex a single codebase becomes...
Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: index3.py
Type: application/octet-stream
Size: 625 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20100714/0cc7a3eb/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: index3b.py
Type: application/octet-stream
Size: 637 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20100714/0cc7a3eb/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: index3latin.py
Type: application/octet-stream
Size: 645 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20100714/0cc7a3eb/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: index2.py
Type: application/octet-stream
Size: 607 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20100714/0cc7a3eb/attachment-0011.obj>
More information about the Biopython-dev
mailing list