[BioPython] blast parser slows down under python2.3

Andrew Nunberg anunberg at oriongenomics.com
Wed Sep 3 10:34:15 EDT 2003


I was wondering if you could patch the stable branch?  Reason is that 
some of what I do may turn into production scripts and I would rather 
not have it tied to bleeding edge code..
Thanks

On Tuesday, September 2, 2003, at 12:45 PM, Jeffrey Chang wrote:

> Yep, it's patched against the CVS.  There is no tag -- just the most 
> recent version.  We've only been tagging the releases.
>
> Jeff
>
>
> On Tuesday, September 2, 2003, at 07:30  AM, Andrew Nunberg wrote:
>
>> I take it that you are applying these patches in CVS?
>> I have only downloaded the tarball for BioPython, would you suggest i 
>> check it out from CVS and what tag should I use?
>>
>> Andy
>>
>> On Sunday, August 31, 2003, at 05:44 PM, Jeffrey Chang wrote:
>>
>>> I have applied the patch.  Thanks very much!
>>>
>>> The regression tests now work again.  For the tests that print out 
>>> booleans, I am now explicitly printing out 0 or 1, for backwards 
>>> compatibility.
>>>
>>> I have also gone through and changed some more instances of apply to 
>>> the new call syntax.  Please let me know if there appears to be any 
>>> problems.
>>>
>>> Jeff
>>>
>>>
>>>
>>> On Friday, August 29, 2003, at 12:07  PM, Jeffrey Chang wrote:
>>>
>>>> Hey, thanks very much for the note, and the patch (mailed 
>>>> separately).
>>>>
>>>> Python 2.3 also seems to have broken some of the regression tests.  
>>>> The boolean type gets printed out as "True" and "False" rather than 
>>>> 1 or 0 as before.
>>>>
>>>> I'll take a look at these over the weekend.
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>>
>>>> On Friday, August 29, 2003, at 10:24  AM, Peter Slickers wrote:
>>>>
>>>>> The biopython blast parser runs at only half of the speed
>>>>> seen with python2.2 when executed with python2.3.
>>>>>
>>>>>
>>>>> This effect is monitored best with a huge blast output file.
>>>>> My setup for measuring the performance is quite simple.
>>>>> I have used a small python script which just parses a blast
>>>>> file and stores the content in memory. I have started this
>>>>> script with the time command, and the python interpreter
>>>>> was explicitely specified either as python2.2 or python2.3.
>>>>> Each run was repeated four times.
>>>>>
>>>>> --------------------------------------------------------------
>>>>> command                                    CPU time in sec
>>>>> --------------------------------------------------------------
>>>>> time python2.2 parser.py blastout.txt      5.11,3.58,3.98,4.15
>>>>> time python2.3 parser.py blastout.txt      8.85,7.97,7.30,7.12
>>>>> --------------------------------------------------------------
>>>>> (with biopython 1.21)
>>>>>
>>>>> I sticked into this when running the python profiler
>>>>> on the blast parser. It turns out, that more
>>>>> than half of the CPU time was spent in the warnings module,
>>>>> which is part of the python standard installation
>>>>> (/usr/local/lib/python2.3/warnings.py).
>>>>>
>>>>> Further digging revealed that the function warn() is called
>>>>> each time the readline() method from class UndoHandle is
>>>>> executed (file site-packages/Bio/File.py).
>>>>>
>>>>> Within the readline() method the python build-in function
>>>>> apply() is heavily used. But since python2.3 the usage of
>>>>> apply() is deprecated, and therefore the warn() function is called
>>>>> by the interpreter each time the apply() function is used.
>>>>>
>>>>>
>>>>> According to the python2.3 manual, the apply() function should be
>>>>> substituted by the "extended call syntax" (which was introduced
>>>>> in python2.0).
>>>>>
>>>>> To test my hypothesis that the perfomance leck ist caused by
>>>>> the apply() function, I took the standard genetical approach
>>>>> of knock-out and complementing: I created a modified version
>>>>> of Bio/File.py where all occurences of apply() were replaced
>>>>> by "extended call syntax". After that, I run the benchmark again:
>>>>>
>>>>> --------------------------------------------------------------
>>>>> command                                    CPU time in sec
>>>>> --------------------------------------------------------------
>>>>> time python2.2 parser.py blastout.txt      4.11,3.53,4.07,4.03
>>>>> time python2.3 parser.py blastout.txt      4.94,4.96,4.54,5.24
>>>>> --------------------------------------------------------------
>>>>> (with modified Bio/File.py)
>>>>>
>>>>>
>>>>> The numbers clearly reveal that my patch successfully reconstitutes
>>>>> the speed of the blast parser under pythons2.3.
>>>>>
>>>>>
>>>>>
>>>>> Fazit:  the "newer, better, faster" dogma is not true with python.
>>>>>
>>>>>
>>>>> Here is an example of what the patch looks like:
>>>>>
>>>>>   old:     line = apply(self._handle.readline, args, keywds)
>>>>>   new:     line = self._handle.readline(*args,**keywds)
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>>
>>>>> Peter
>>>>> -------------------------------------------------------------------
>>>>> Peter Slickers                             piet at clondiag.com
>>>>> Clondiag Chip Technologies                 http://www.clondiag.com/
>>>>> Löbstedter Str. 105
>>>>> 07749 Jena
>>>>> Germany
>>>>>
>>>>> Fon:  03641/5947-65                        Fax:  03641/5947-20
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> BioPython mailing list  -  BioPython at biopython.org
>>>>> http://biopython.org/mailman/listinfo/biopython
>>>>
>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at biopython.org
>>> http://biopython.org/mailman/listinfo/biopython
>>>
>>>
>> ---------------------------------------------------
>> Andrew Nunberg Ph.D
>> Bioinfomagician
>> Orion Genomics
>> 4041 Forest Park
>> St Louis, MO
>> 314-615-6989
>> anunberg at oriongenomics.com
>> www.oriongenomics.com
>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at biopython.org
>> http://biopython.org/mailman/listinfo/biopython
>
>
---------------------------------------------------
Andrew Nunberg Ph.D
Bioinfomagician
Orion Genomics
4041 Forest Park
St Louis, MO
314-615-6989
anunberg at oriongenomics.com
www.oriongenomics.com




More information about the BioPython mailing list