[Bioperl-l] malloc errors while using Bio::SeqIO?
Rutger Vos
rvos at interchange.ubc.ca
Fri Aug 8 23:59:20 UTC 2008
Hi,
while going through a large genbank file
(ftp://ftp.ncbi.nlm.nih.gov/genbank/gbpri21.seq.gz) I ran into malloc
errors. Just for the record (I doubt this does anyone any good), I
got:
perl(391) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(391) malloc: *** error: can't allocate region
perl(391) malloc: *** set a breakpoint in szone_error to debug
Out of memory!
What I was trying to do is go through the file, and only write out
those seq objects that aren't human, and that have CDS features, i.e.:
################################################
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $dir = shift @ARGV; # the directory with *.gz files
my $out = shift @ARGV; # the directory to write to...
mkdir $out if not -d $out; # ...which may need to be created
opendir my $dirhandle, $dir or die $!;
for my $archive ( readdir $dirhandle ) {
next if $archive !~ /\.gz$/;
my $file = $archive;
$file =~ s/\.gz$//;
# external call to the gunzip utility,
# such that we keep the archive
system( "gunzip -c \"${dir}/${archive}\" > \"${dir}/${file}\"" );
# object that parses genbank files,
# returns Bio::Seq objects
my $reader = Bio::SeqIO->new(
'-format' => 'genbank',
'-file' => "${dir}/${file}"
);
# object that receives Bio::Seq objects,
# writes genbank files
my $writer = Bio::SeqIO->new(
'-format' => 'genbank',
'-file' => ">${out}/${file}",
);
while ( my $seq = $reader->next_seq ) {
my $name = $seq->species->binomial;
if ( $name ne 'Homo sapiens' ) {
# search for coding sequences among the features
my $HasCDS = 0;
FEATURE: for my $f ( $seq->get_SeqFeatures ) {
if ( $f->primary_tag eq 'CDS' ) {
$HasCDS++;
last FEATURE;
}
}
# write the sequence to file
if ( $HasCDS ) {
$writer->write_seq( $seq );
}
}
}
# delete the extracted, unfiltered file
unlink "${dir}/${file}";
}
################################################
Okay, so it runs out of memory. Can I do something to fix that? Should
I flush on either of the I/O objects after each $seq? Could there be
memory leaks in the Bio::Seq objects? Should I $seq->DESTROY them
explicitly or something like that?
Thanks,
Rutger
--
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com
More information about the Bioperl-l
mailing list