[Biopython-dev] Python 3 and Bio.SeqIO.index()

Wed Jul 14 18:09:15 UTC 2010

On Wed, Jul 14, 2010 at 6:47 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Using a FASTA file with 7 million entries (converted from SRA
> entry SRR001666_1.fastq), we have:
>
> c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s
> c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it)
> c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s
>
> I think the reason that Python 3 binary is faster than Python 2
> is we are using universal read lines mode in Python 2, which will
> add an overhead (both for reading, and in calculating the offset).

Confirmed - switching the mode from "rU" to "rb" to give index2.py,

c:\python27\python index2.py SRR001666_1.fasta - Indexed in 76.96s
c:\python27\python index2b.py SRR001666_1.fasta - Indexed in 36.62s

I've had a quick go at doing this for Bio.SeqIO.index(), and with the
catch that the get_raw() functionality then returns the underlying newlines
(which we can fix if need be) it seems to work (unit tests pass). This
may be worth following up on regardless of the Python 3 work, since
the speed up is pretty good (from 97s to 52s on this example on
Windows). We'd need more testing for the cross platform issues of
course.

I wonder if the same speed up happens on Linux / Mac OS X?
Something to try tomorrow I guess.

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seqio_index_b.patch
Type: application/octet-stream
Size: 1288 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20100714/d97bb4a0/attachment-0002.obj>