[Bioperl-l] malloc errors while using Bio::SeqIO?

Rutger Vos rvos at interchange.ubc.ca
Fri Aug 8 19:59:20 EDT 2008


Hi,

while going through a large genbank file
(ftp://ftp.ncbi.nlm.nih.gov/genbank/gbpri21.seq.gz) I ran into malloc
errors. Just for the record (I doubt this does anyone any good), I
got:

perl(391) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(391) malloc: *** error: can't allocate region
perl(391) malloc: *** set a breakpoint in szone_error to debug
Out of memory!

What I was trying to do is go through the file, and only write out
those seq objects that aren't human, and that have CDS features, i.e.:

################################################

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $dir = shift @ARGV; # the directory with *.gz files
my $out = shift @ARGV; # the directory to write to...
mkdir $out if not -d $out; # ...which may need to be created

opendir my $dirhandle, $dir or die $!;
for my $archive ( readdir $dirhandle ) {
    next if $archive !~ /\.gz$/;
    my $file = $archive;
    $file =~ s/\.gz$//;

    # external call to the gunzip utility,
    # such that we keep the archive
    system( "gunzip -c \"${dir}/${archive}\" > \"${dir}/${file}\"" );

    # object that parses genbank files,
    # returns Bio::Seq objects
    my $reader = Bio::SeqIO->new(
        '-format' => 'genbank',
        '-file'   => "${dir}/${file}"
    );

    # object that receives Bio::Seq objects,
    # writes genbank files
    my $writer = Bio::SeqIO->new(
        '-format' => 'genbank',
        '-file'   => ">${out}/${file}",
    );
    while ( my $seq = $reader->next_seq ) {
        my $name = $seq->species->binomial;
        if ( $name ne 'Homo sapiens' ) {

            # search for coding sequences among the features
            my $HasCDS = 0;
            FEATURE: for my $f ( $seq->get_SeqFeatures ) {
                if ( $f->primary_tag eq 'CDS' ) {
                    $HasCDS++;
                    last FEATURE;
                }
            }

            # write the sequence to file
            if ( $HasCDS ) {
                $writer->write_seq( $seq );
            }
        }
    }

    # delete the extracted, unfiltered file
    unlink "${dir}/${file}";
}

################################################

Okay, so it runs out of memory. Can I do something to fix that? Should
I flush on either of the I/O objects after each $seq? Could there be
memory leaks in the Bio::Seq objects? Should I $seq->DESTROY them
explicitly or something like that?

Thanks,

Rutger

-- 
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com



More information about the Bioperl-l mailing list