[BioPython] Our first bug!

Andrew Dalke dalke@bioreason.com
Sun, 05 Sep 1999 00:45:51 -0600


Bradley Marshall <bradbioperl@yahoo.com> said:

> I found a bug!!

>> PERL is very good for parsing large data

> I assume you mean PYTHON is very good at.....

Since I wrote it, I'll answer :)

I based this statement on a claim by Tim Peters in comp.lang.python
http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=505288860&fmt=text
> Line-at-a-time input on my platform is about 3 times faster in
> Perl than Python; indeed, it's faster in Perl than in C!  It's not
> loop mechanisms or even refcounting "to blame" here (btw, Perl
> does refcounting too ...).  It's that Perl breaks into the stdio
> abstraction, going under the covers in platform + compiler specific
> ways, peeking and poking C stdio FILE structs directly.
> 
> Very clever, but a lot of work.  Most fgets implementations suck,
> and Perl deserves all the credit for not settling for that.  I
> doubt Python will do it too, not as a matter of principle, but
> because nobody capable of it will make time to do it.

I also did some timing comparison with Prosite patterns against
randomly generated strings and found the Perl5 regex engine was
about twice as fast as the then-new Python re module.  Interestingly,
Perl4's was about twice as fast as Perl5's!

So I think that the Perl implementation is better than Python's
if you are interested in fast parsing of a large data sets, and
if don't need many different modules working together.


Still, I should clarify that section somewhat.  It's meant to say
that there are very good reasons for the historical use of Perl
in bioinformatics, and that if Python is to make inroads then it
must address those advantages.

For example, most files for this field are parsed in a line-by-line
fashion, and most of my code uses .readline().  According to Guido
 http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=400462944&fmt=text
readlines() is much faster, so that may be something to point out.

I've also wondered if Andrew Kuchling's mmapfile module would help
with file read performance.


						Andrew Dalke
						dalke@bioreason.com