[Bioperl-l] Re: [Bioclusters] BioPerl and memory handling

Ian Korf iankorf at mac.com
Tue Nov 30 02:42:23 EST 2004


I tried your example (fixing the syntax error of $sel->{'foo'} to 
$self->{'foo'} and I find that I get back half the memory after undef, 
which is exactly the behavior I described. This could be differences in 
Perl versions. What version of Perl are you using? perl -v gives me: 
This is perl, v5.8.1-RC3 built for darwin-thread-multi-2level

On Nov 29, 2004, at 8:12 PM, Malay wrote:

> Thanks Ian for your mail. But you have missed a major point of the 
> original discussion. What happens to object? So I did the same test 
> that you did using object. Here is the result.
>
> use strict;
> package Test;
>
> sub new {
>      my $class =shift;
>    my $self = {};
>    bless $self, $class;
>    $sel->{'foo'} = 'N' x 100000000;
>    return $self;
> }
>
> package main;
>
> my $ob = Test->new();    #uses 197 MB as you said.
>
> undef $ob;  ## still uses 197 MB ???!!!!
>
> This was the original point. Perl never releases memory for the 
> initial object creation.  Infact try doing this in whatever way  
> possible, reusing references or undeffing it, the memory usage will 
> never go down below 197 MB, till the executaion duration of the 
> program.
>
> So I humbly differ in my opinion in any elaborate in-memory object 
> hierarchy in Perl. The language is not meant for that.  But I am 
> nobody, stallwarts will differ in opinion.
>
> -Malay
>
>
>
>
>
>
>
> Ian Korf wrote:
>
>> After a recent conversation about memory in Perl, I decided to do 
>> some actual experiments. Here's the email I composed on the subject.
>>
>>
>> I looked into the Perl memory issue. It's true that if you allocate a 
>> huge amount of memory that Perl doesn't like to give it back. But 
>> it's not as bad a situation as you might think. Let's say you do 
>> something like
>>
>>     $FOO = 'N' x 100000000;
>>
>> That will allocate a chunk of about 192 Mb on my system. It doesn't 
>> matter if this is a package variable or lexical.
>>
>>     our $FOO = 'N' x 100000000; # 192 Mb
>>     my  $FOO = 'N' x 100000000; # 192 Mb
>>
>> If you put this in a subroutine
>>
>>     sub foo {my $FOO = 'N' x 100000000}
>>
>> and you call this a bunch of times
>>
>>     foo(); foo(); foo(); foo(); foo(); foo(); foo();
>>
>> the memory footprint stays at 192 Mb. So Perl's garbage collection 
>> works just fine. Perl doesn't let go of the memory it has taken from 
>> the OS, but it is happy to reassign the memory it has reserved.
>>
>> Here's something odd. The following labeled block looks like it 
>> should use no memory.
>>
>>     BLOCK: {
>>         my  $FOO = 'N' x 100000000;
>>     }
>>
>> The weird thing is that after executing the block, the memory 
>> footprint is still 192 Mb as if it hadn't been garbage collected.
>>
>> Now look at this:
>>
>>     my $foo = 'X' x 100000000;
>>     undef $foo;
>>
>> This has a memory footprint of 96 Mb. After some more 
>> experimentation, I have come up with the following interpretation of 
>> memory allocation and garbage collection in Perl. Perl will reuse 
>> memory for a variable of a given name (either package or lexical 
>> scope). There is no fear of memory leaks in loops for example. But 
>> each different named variable will retain its own minimum memory. 
>> That minimum memory is the size of the largest memory allocated to 
>> that variable, or half that amount if other variables have taken some 
>> of that space already. You can get any variable to automatically give 
>> up half its memory with undef. But this takes a little more CPU time. 
>> Here's some test code that shows this behavior.
>>
>> sub foo {my $FOO = 'N' x 100000000}
>> for (my $i = 0; $i < 50; $i++) {foo()} # 29.420u 1.040s
>>
>> sub bar {my $BAR = 'N' x 100000000; undef $BAR}
>> for (my $i = 0; $i < 50; $i++) {bar()} # 26.880u 21.220s
>>
>> The increase from 1 sec to 21 sec system CPU time is all the extra 
>> memory allocation and freeing associated with the undef statement. 
>> Why the user time is less in the undef example is a mystery to me.
>>
>> OK, to make a hideously long story short, use undef to save memory 
>> and use the same variable name over and over if you can.
>>
>> ---
>>
>> But this email thread has gone to BPlite, of which I am the original 
>> author. BPlite is designed to parse a stream and only reads a minimal 
>> amount of information at a time. The disadvantage of this is that if 
>> you want to know something about statistics, you can't get this until 
>> the end of the report (the original BPlite ignored statistics 
>> entirely). I like the new SearchIO interface better than BPlite, but 
>> for my own uses I generally use a table format most of the time and 
>> don't really use a BLAST parser very often.
>>
>> -Ian
>>
>> On Nov 29, 2004, at 3:03 PM, Mike Cariaso wrote:
>>
>>> This message is being cross posted from bioclusters to
>>> bioperl. I'd appreciate a clarification from anyone in
>>> bioperl who can speak more authoritatively than my
>>> semi-speculation.
>>>
>>>
>>> Perl does have a garbage collector. It is not wildly
>>> sophisticated. As you've suggested it uses simple
>>> reference counting. This means that circular
>>> references will cause memory to be held until program
>>> termination.
>>>
>>> However I think you are overstating the inefficiency
>>> in the system. While the perl GC *may* not release
>>> memory to the system, it does at least allow memory to
>>> be reused within the process.
>>>
>>> If the system instead behaved as you describe, I think
>>> perl would hemorrhage memory and would be unsuitable
>>> for any long running processes.
>>>
>>> However I can say with considerable certainty that
>>> that BPLite is able to handle blast reports which
>>> cause SearchIO to thrash. I've attributed this to
>>> BPLite being a true stream processor, while SearchIO
>>> seems to slurp the whole file and object heirarchy
>>> into memory.
>>>
>>> I know that SearchIO is the prefered blast parser, but
>>> it seems that BPLite is not quite dead, for the
>>> reasons above. If this is infact the unique benefit of
>>> BPLite, perhaps the documentation should be clearer
>>> about this, as I suspect I'm not the only person to
>>> have had to reengineer a substantial piece of code to
>>> adjust between their different models. Had I known of
>>> this difference early on I would have chosen BPLite.
>>>
>>> So, bioperlers (especially Jason Stajich) can you shed
>>> any light on this vestigial bioperl organ?
>>>
>>>
>>>
>>> --- Malay <mbasu at mail.nih.gov> wrote:
>>>
>>>> Michael Cariaso wrote:
>>>>
>>>>> Michael Maibaum wrote:
>>>>>
>>>>>>
>>>>>> On 10 Nov 2004, at 18:25, Al Tucker wrote:
>>>>>>
>>>>>>> Hi everybody.
>>>>>>>
>>>>>>> We're new to the Inquiry Xserve scientific
>>>>>>
>>>> cluster and trying to iron
>>>>
>>>>>>> out a few things.
>>>>>>>
>>>>>>> One thing is we seem to be coming up against is
>>>>>>
>>>> an out of memory
>>>>
>>>>>>> error when getting large sequence analysis
>>>>>>
>>>> results (5,000 seq - at
>>>>
>>>>>>> least- and above) back from BTblastall. The
>>>>>>
>>>> problem seems to be with
>>>>
>>>>>>> BioPerl.
>>>>>>>
>>>>>>> Might anyone here know if BioPerl is knows
>>>>>>
>>>> enough not to try and
>>>>
>>>>>>> access more than 4gb of RAM in a single process
>>>>>>
>>>> (an OS X limit)? I'm
>>>>
>>>>>>> told Blastall and BTblastall are and will chunk
>>>>>>
>>>> problems accordingly,
>>>>
>>>>>>> but we're not certain if BioPerl is when called
>>>>>>
>>>> to merge large Blast
>>>>
>>>>>>> results back together. It's the default version
>>>>>>
>>>> 1.2.3 that's supplied
>>>>
>>>>>>> btw, and OS X 10.3.5 with all current updates
>>>>>>
>>>> just short of the
>>>>
>>>>>>> latest 10.3.6 update.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>>> BioPerl tries to slurp up the entire results set
>>>>>
>>>> from a BLAST query,
>>>>
>>>>>> and build objects for each little bit of the
>>>>>
>>>> result set and uses lots
>>>>
>>>>>> of memory. It doesn't have anything smart at all
>>>>>
>>>> about breaking up the
>>>>
>>>>>> job within the result set, afaik.
>>>>>>
>>>>
>>>> This is not really true. SearchIO module as far as I
>>>> know works on stream.
>>>>
>>>>>>  I ended up stripping out results that hit a
>>>>>
>>>> certain threshold size to
>>>>
>>>>>> run on a different, large memory opteron/linux
>>>>>
>>>> box and I'm
>>>>
>>>>>> experimenting with replacing BioPerl with
>>>>>
>>>> BioPython etc.
>>>>
>>>>>>
>>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> You may find hthat the BPLite parser works better
>>>>
>>>> when dealing with
>>>>
>>>>> large blast result files. Its not as clean or
>>>>
>>>> maintained, but it does
>>>>
>>>>> the job nicely for my current needs, which
>>>>
>>>> overloaded the usual parser.
>>>>
>>>> There is basically no difference between BPLite and
>>>> other BLAST parser
>>>> interfaces in Bioperl.
>>>>
>>>>
>>>> The problem lies in the core of Perl iteself. Perl
>>>> does not release
>>>> memory to the system even after the reference count
>>>> of an object created
>>>> in the memory goes to 0, unless the program in
>>>> actually over. Perl
>>>> object system in highly inefficient to handle large
>>>> number of objects
>>>> created in the memory.
>>>>
>>>> -Malay
>>>> _______________________________________________
>>>> Bioclusters maillist  -
>>>> Bioclusters at bioinformatics.org
>>>>
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>>>
>>>
>>>
>>> =====
>>> Mike Cariaso
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioperl-l mailing list