[Bioperl-l] Asking for advice on full EMBL extraction

Thu May 7 21:24:53 UTC 2009

I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work.
Give this a go:

============================
#!perl -w

use Bio::SeqIO;
use IO::String;

use constant SEP => "//\n";

open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;

my $index = 1;

while(my $stringfh = new IO::String(get_next_record($fh))){

	my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL" ) or die $!;

	while ( my $seq_object = $seqio->next_seq ) {
	  print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";

	  # show the features
	  for my $feat_object ($seq_object->get_SeqFeatures) {
		   print "primary tag: ", $feat_object->primary_tag, "\n";
		   for my $tag ($feat_object->get_all_tags) {             
		      print "  tag: ", $tag, "\n";             
		      for my $value ($feat_object->get_tag_values($tag)) {                
		         print "    value: ", $value, "\n";             
		      }          
		   }       
		}
	}

}

sub get_next_record{
	my($fh) = @_;
	(my $old_sep,$/) = ($/,SEP);
	my $record = <$fh>;
	$/ = $old_sep;
	return $record;
}
======================================== 

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of brian li
> Sent: Friday, 8 May 2009 1:00 a.m.
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> 
> My server has 32 GB RAM.
> 
> The os of my server is 64-bit version of Ubuntu Server Edition 8.04
> LTS. And I have run my example code on another server with 32-bit
> version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
> 
> -Brian
> 
> On Thu, May 7, 2009 at 8:07 PM, Chris Fields <cjfields at illinois.edu> wrote:
> > I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?
> >
> > chris
> >
> > On May 7, 2009, at 12:32 AM, brian li wrote:
> >
> >> Thank you very much for your offer.
> >>
> >> The director of our lab wants me to do the extraction every time a new
> >> release of EMBL is published. I can't push the task to you every time.
> >>
> >> I can offer more information of the server I run my script on if needed.
> >>
> >> -Brian
> >>
> >> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
> >> <Russell.Smithies at agresearch.co.nz> wrote:
> >>>
> >>> Sadly, that's the same code as I ran but I had a Data::Dump in the
> >>> middle.
> >>> Versions of Perl and BioPerl are the same.
> >>> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
> >>>
> >>> If you get a full script running on a smaller dataset, I could probably
> >>> run it on the bigger stuff and give you back tab-separated (or is that
> >>> tab\tseparated ?) data for loading into your db.
> >>>
> >>> --Russell
> >>>
> >>>> -----Original Message-----
> >>>> From: brian li [mailto:brianli.cas at gmail.com]
> >>>> Sent: Thursday, 7 May 2009 4:50 p.m.
> >>>> To: Smithies, Russell
> >>>> Cc: bioperl-l at lists.open-bio.org
> >>>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
> >>>>
> >>>> Dear Russell,
> >>>>
> >>>> My example code is as following. I omit the parse process and these
> >>>> lines give me "Segmentation Fault" too.
> >>>>
> >>>> # Start of code
> >>>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
> >>>>                                             -format => 'EMBL');
> >>>> my $index = 1;
> >>>> while (my $seq = $seqio->next_seq)
> >>>> {
> >>>>    print "Dealing with entry: $index\n";
> >>>>    $index++;
> >>>> }
> >>>> # End
> >>>>
> >>>> The platform I run this code on:
> >>>> BioPerl 1.6.0
> >>>> Perl 5.8.8
> >>>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
> >>>>
> >>>> I have monitored the memory usage when I run the code above. There is
> >>>> always around 20GB free memory (buffer size counted in) left. So I
> >>>> suppose the segfault can't be explained just by memory shortage.
> >>>>
> >>>> Brian
> >>>>
> >>>>
> >>>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
> >>>> <Russell.Smithies at agresearch.co.nz> wrote:
> >>>>>
> >>>>> Hi Brian,
> >>>>> I hate to say it but it worked OK for me using
> >>>>> rel_ann_mus_01_r99.dat.gz and
> >>>>
> >>>> simple example Bio::SeqIO code from bugzilla
> >>>>>
> >>>>> It's not using more than 1GB memory on our server and doesn't segfault.
> >>>>>
> >>>>> Send me your example code and I'll give it a go if you like.
> >>>>>
> >>>>>
> >>>>> Russell Smithies
> >>>>>
> >>>>> Bioinformatics Applications Developer
> >>>>> T +64 3 489 9085
> >>>>> E  russell.smithies at agresearch.co.nz
> >>>>>
> >>>>> Invermay  Research Centre
> >>>>> Puddle Alley,
> >>>>> Mosgiel,
> >>>>> New Zealand
> >>>>> T  +64 3 489 3809
> >>>>> F  +64 3 489 9174
> >>>>> www.agresearch.co.nz
> >>>>>
> >>>>>
> >>> =======================================================================
> >>> Attention: The information contained in this message and/or attachments
> >>> from AgResearch Limited is intended only for the persons or entities
> >>> to which it is addressed and may contain confidential and/or privileged
> >>> material. Any review, retransmission, dissemination or other use of, or
> >>> taking of any action in reliance upon, this information by persons or
> >>> entities other than the intended recipients is prohibited by AgResearch
> >>> Limited. If you have received this message in error, please notify the
> >>> sender immediately.
> >>> =======================================================================
> >>>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l