[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Fri Jul 28 15:05:21 UTC 2006

Jeffrey Chang wrote:
> ...  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.

The following timings are for iterating over a large fasta file 
(Escherichia_coli_K12, NC_000913.ffn, with 5254 nucleotide CDS sequences).

The test script is attached, the test input is available here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.ffn

I used BioPython 1.42 with Python 2.3 on Windows XP on a laptop computer.

Apart from Fasta.RecordParser, these all return a SeqRecord object with 
a generic alphabet:

0.89s SeqIO.FASTA.FastaReader (for record in interator)
0.88s SeqIO.FASTA.FastaReader (iterator.next)
0.88s SeqIO.FASTA.FastaReader (iterator[i])

5.52s FormatIO/SeqRecord (for record in interator)
5.41s FormatIO/SeqRecord (iterator.next)

6.06s Fasta.RecordParser (for record in interator)
6.10s Fasta.SequenceParser (for record in interator)
6.27s Fasta.SequenceParser (iterator.next)

As you can see, SeqIO.FASTA.FastaReader (written in simple python) is 
about six times faster than both the martel based parsers.

I have tried this on a file with 2000 records and see a similar scaling.

Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test_fasta_methods.py
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060728/93dbbbb7/attachment.ksh>