[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 12:12:26 UTC 2006

Leighton Pritchard wrote:
> Just to add to the confusion, when parsing large FASTA sequence files, I
> have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
> drop me a line).  I've used Peter's test framework on the same input
> file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
> Core 3 (up-to-date, eh? ;) ) to get the following typical results:

Times for NC_000913.ffn when returning SeqRecord objects:
> 4.07s FormatIO/SeqRecord (for record in interator)
> 4.05s FormatIO/SeqRecord (iterator.next)
> 5.00s Fasta.SequenceParser (for record in interator)
> 4.80s Fasta.SequenceParser (iterator.next)

 > 0.32s SeqIO.FASTA.FastaReader (for record in interator)
 > 0.30s SeqIO.FASTA.FastaReader (iterator.next)
 > 0.31s SeqIO.FASTA.FastaReader (iterator[i])

> 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

And again, but for Phytophthora infestans ESTs with 72000 entries
 > 51.22s FormatIO/SeqRecord (for record in interator)
 > 45.64s FormatIO/SeqRecord (iterator.next)
 > 59.97s Fasta.SequenceParser (for record in interator)
 > 58.70s Fasta.SequenceParser (iterator.next)

 > 4.26s SeqIO.FASTA.FastaReader (for record in interator)
 > 4.10s SeqIO.FASTA.FastaReader (iterator.next)
 > 4.30s SeqIO.FASTA.FastaReader (iterator[i])

 > 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
 > 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
 > 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

I imagine this file is much much larger than what most of our uses work 
with - but it does clearly show that the Martel parsers do not scale well.

Out of interest, are the sequences in this file split into multiple 
lines (e.g. max length 80) or are they all single (long) lines?  I would 
expect the later to be quicker to load due to less string operations.

 > Of course, the hassles of including a Flex-based parser in a general
 > BioPython release probably outweigh the marginal time-saving benefits
 > (see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
 > SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and
 > beat the inclusion of a Flex-based parser hands-down in terms of
 > maintainability and portability.

I agree with you completely that we should avoid the Flex parser based 
on those grounds, as we can get "close enough" with pure python. 
Especially if we do something about the overhead of Seq and SeqRecord 
objects.

I did some work on a brand new SeqIO over the weekend. I had got the 
fasta iterator slightly quicker too.

The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
entire file into memory in one go, and then parses it.  On the other 
hand its not perfect: I would use "\n>" as the split marker rather than 
">" which could appear in the description of a sequence.

The iterator approach is probably slower but requires much less memory. 
  How big is your 72,000 entry file in MB?  Do we need to worry about 
the size of the raw file in memory - allowing the parsers to load it 
into memory could make things much faster...

Peter