[Biopython-dev] Performance of Bio.File.UndoHandle
Jeffrey Chang
jchang at jeffchang.com
Fri Oct 17 15:03:10 EDT 2003
On Thursday, October 16, 2003, at 05:45 AM, Michael Hoffman wrote:
> On Wed, 15 Oct 2003, Jeffrey Chang wrote:
>
>> That is a nice implementation. However, Biopython already has at
>> least
>> 3 Fasta parsers!
>> Bio/Fasta
>> Bio/SeqIO/FASTA
>> Bio/expressions/fasta
>
> There sure are. We should probably be cutting them rather than adding
> them I suppose. :-) Have you thought of deprecating Bio.Fasta since it
> is the slowest?
Yes, that will probably be done eventually. However, it does have a
nice interface that's consistent with the other parsers, e.g. for
GenBank, and it's documented. We'd be deprecating the best documented
parser for faster ones that aren't documented. (As you noticed, not
even docstrings.) It's trade-off. The decision would be much clearer
if the other parsers had better documentation! ;)
> I know that the official path is to get people towards FormatIO but
> Bio.expressions.fasta is more than 12x slower than my
> implementation/Bio.SeqIO.FASTA (comparable as you predicted)! For one
> test:
>
> FormatIO: 3.085s/3.094s/3.154s
> LightIterator: 0.246s/0.243s/0.245s
Yikes! Your code is correct. However, in fairness, the fasta parser
that FormatIO is doing more work, such as trying to detect database IDs
(GenBank, EMBL, DDBJ, NBRF) in the description line. However, if
that's something that's not generally needed, perhaps that
functionality should be off by default, so that the parser would be
faster. Everybody likes that, right?
Jeff
More information about the Biopython-dev
mailing list