[Bioperl-l] Asking for advice on full EMBL extraction

Sun May 10 22:43:48 EDT 2009

Thanks for your advice.

I agree with you that some features lines are causing the segfault.
But I don't know which ones. I am afraid splitting big files by seqs
could not help as I don't know which chunk has a mean feature :) Can't
try some code and run for every file. I think for now I have to just
skip the features and make the extraction run first.

--Brian

On Mon, May 11, 2009 at 4:49 AM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz> wrote:
> How about splitting the big file into smaller chunks and processing one sequence at a time?
> It could be one specific feature line that's causing the segfault and nothing to do with file size.
> You should be able to split the file with awk as well (I like awk :-)
>
> zcat rel_ann_mus_01_r99.dat.gz | awk 'BEGIN{RS="//";OFS="\n"}{$1=$1; print > "chunk"NR}'
>
> --Russell
>
>> -----Original Message-----
>> From: brian li [mailto:brianli.cas at gmail.com]
>> Sent: Saturday, 9 May 2009 2:49 a.m.
>> To: Smithies, Russell
>> Cc: bioperl-l at lists.open-bio.org; Jason Stajich; Chris Fields
>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>>
>> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk
>> '!/^FT|^CO/{print}' |" works.
>> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^
>> /{print}' |" segfaults.
>>
>> So it seems the features are causing problems. Although I still don't
>> know how that hurts my os to pop a segfault, my extraction can move on
>> again. Maybe I can find a clue when I know more about my os's memory
>> management strategy.
>>
>> Really appreciate all your help.
>>
>> -Brian
>>
>> On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell
>> <Russell.Smithies at agresearch.co.nz> wrote:
>> > I think the problem here though is the size of the sequences rather than too
>> > many features.
>> >
>> > If one was inclined to bodge/hack and didn't care about sequence, I guess
>> > you could filter them out with awk so Bio::SeqIO doesn't have to create the
>> > Bio::PrimarySeq J
>> >
>> > Probably breaks the EMBL file spec .
>> >
>> > Eg.
>> >
>> > open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |"
>> > ) or die;
>> >
>> >
>> >
>> >
>> >
>> > --Russell
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
>> > Stajich
>> > Sent: Friday, 8 May 2009 11:25 a.m.
>> > To: Smithies, Russell
>> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
>> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>> >
>> >
>> >
>> > It parses from a stream or file, one sequence at a time so it only reads a
>> > single sequence out at a time, but it does have to parse that whole sequence
>> > record which is where feature rich sequences might be causing problems.
>> >
>> >
>> >
>> > I think per your other mention of Tie::File - the whole file is not going
>> > into memory so that is not the problem, it is the creation of many objects
>> > that it does as it parses the sequence that is likely the problem.  It will
>> > read up to the first "//" from that Tie::File anyways, that becomes an
>> > entire string which is then parsed to pull out the relevant features so you
>> > don't gain anything with Tie::File -- what would be the way to solve it is
>> > if the objects could be created and reside in a DB on disk rather than
>> > in-memory.  I'd really enjoy seeing more indexed and hashed data to objects
>> > stored on disk when mem requirements are such so that very large datasets
>> > can be handled more nimbly.
>> >
>> >
>> >
>> > I think there have been several attempts to simplify, but it basically means
>> > a dedicated developer to really overhaul or map to a new system.  What we've
>> > tried to build is a decent API so a new implementation can be done without
>> > affecting the 'next_seq' and 'write_seq' API.
>> >
>> >
>> >
>> > Non-withstanding the seemed API confusion caused by _ancient_ decisions on
>> > giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq'
>> > which return different types -- don't forget that Lincoln's Bio::DB::Fasta
>> > uses the 'seq' method to return a sequence as a string as well so major API
>> > changes in general here will create in all likelihood a big split between
>> > the branches that will make any new Bioperl not match up well with existing
>> > scripts or libraries that use it - hence the reason for no "great
>> > realigning" to a completely well-planned out API rather than the organically
>> > grown whims of several generations of devs.  I say this in jest a bit - I do
>> > want to see changes, but I think it really will have to be called something
>> > else besides BioPerl to avoid confusion and the fact that a lot of things
>> > will break that depend on the current APIs.  BioPerl2 or something
>> > indicating a Perl6 association.
>> >
>> >
>> >
>> > -jason
>> >
>> > On May 7, 2009, at 3:05 PM, Smithies, Russell wrote:
>> >
>> > OK, I misunderstood, I thought the entire file loaded was loaded into memory
>> > first then each sequence was extracted from there.
>> > I hoped splitting into 588 individual sequences might help.
>> >
>> > --Russell
>> >
>> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
>> > Stajich
>> > Sent: Friday, 8 May 2009 9:55 a.m.
>> > To: Smithies, Russell
>> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
>> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>> >
>> > Russell -
>> >
>> > I am not sure how that will help as only 1 sequence is parsed at a time by
>> > SeqIO parsers and they use the "//" delimiter.
>> >
>> > If the equivalent data exists in genbank format at NCBI I think _that_
>> >  module (Bio::SeqIO::genbank) has the ability to ignore
>> > annotations/features.  Really we have to re-work the whole thing to be more
>> > lightweight and lazy-parse.
>> >
>> > -jason
>> > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:
>> >
>> >
>> > I'm not sure if this will help with your problem or how it deals with memory
>> > management but using "ordinary" Perl to split the large EMBL file might
>> > work.
>> > Give this a go:
>> >
>> > ============================
>> > #!perl -w
>> >
>> > use Bio::SeqIO;
>> > use IO::String;
>> >
>> > use constant SEP => "//\n";
>> >
>> > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;
>> >
>> > my $index = 1;
>> >
>> > while(my $stringfh = new IO::String(get_next_record($fh))){
>> >
>> >          my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL"
>> > ) or die $!;
>> >
>> >          while ( my $seq_object = $seqio->next_seq ) {
>> >           print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";
>> >
>> >           # show the features
>> >           for my $feat_object ($seq_object->get_SeqFeatures) {
>> >                        print "primary tag: ", $feat_object->primary_tag,
>> > "\n";
>> >                        for my $tag ($feat_object->get_all_tags) {
>> >                           print "  tag: ", $tag, "\n";
>> >                           for my $value ($feat_object->get_tag_values($tag))
>> > {
>> >                              print "    value: ", $value, "\n";
>> >                           }
>> >                        }
>> >                      }
>> >          }
>> >
>> > }
>> >
>> >
>> > sub get_next_record{
>> >          my($fh) = @_;
>> >          (my $old_sep,$/) = ($/,SEP);
>> >          my $record = <$fh>;
>> >          $/ = $old_sep;
>> >          return $record;
>> > }
>> > ========================================
>> >
>> >
>> > --Russell
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> > bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of
>> > brian li
>> > Sent: Friday, 8 May 2009 1:00 a.m.
>> > To: Chris Fields
>> > Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
>> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>> >
>> > My server has 32 GB RAM.
>> >
>> > The os of my server is 64-bit version of Ubuntu Server Edition 8.04
>> > LTS. And I have run my example code on another server with 32-bit
>> > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
>> >
>> > -Brian
>> >
>> > On Thu, May 7, 2009 at 8:07 PM, Chris Fields
>> > <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
>> > I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?
>> >
>> > chris
>> >
>> > On May 7, 2009, at 12:32 AM, brian li wrote:
>> >
>> > Thank you very much for your offer.
>> >
>> > The director of our lab wants me to do the extraction every time a new
>> > release of EMBL is published. I can't push the task to you every time.
>> >
>> > I can offer more information of the server I run my script on if needed.
>> >
>> > -Brian
>> >
>> > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
>> >
>> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
>> > wrote:
>> >
>> > Sadly, that's the same code as I ran but I had a Data::Dump in the
>> > middle.
>> > Versions of Perl and BioPerl are the same.
>> > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
>> >
>> > If you get a full script running on a smaller dataset, I could probably
>> > run it on the bigger stuff and give you back tab-separated (or is that
>> > tab\tseparated ?) data for loading into your db.
>> >
>> > --Russell
>> >
>> > -----Original Message-----
>> > From: brian li [mailto:brianli.cas at gmail.com]
>> > Sent: Thursday, 7 May 2009 4:50 p.m.
>> > To: Smithies, Russell
>> > Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
>> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>> >
>> > Dear Russell,
>> >
>> > My example code is as following. I omit the parse process and these
>> > lines give me "Segmentation Fault" too.
>> >
>> > # Start of code
>> > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
>> >                                            -format => 'EMBL');
>> > my $index = 1;
>> > while (my $seq = $seqio->next_seq)
>> > {
>> >   print "Dealing with entry: $index\n";
>> >   $index++;
>> > }
>> > # End
>> >
>> > The platform I run this code on:
>> > BioPerl 1.6.0
>> > Perl 5.8.8
>> > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
>> >
>> > I have monitored the memory usage when I run the code above. There is
>> > always around 20GB free memory (buffer size counted in) left. So I
>> > suppose the segfault can't be explained just by memory shortage.
>> >
>> > Brian
>> >
>> >
>> > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
>> >
>> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
>> > wrote:
>> >
>> > Hi Brian,
>> > I hate to say it but it worked OK for me using
>> > rel_ann_mus_01_r99.dat.gz and
>> >
>> > simple example Bio::SeqIO code from bugzilla
>> >
>> > It's not using more than 1GB memory on our server and doesn't segfault.
>> >
>> > Send me your example code and I'll give it a go if you like.
>> >
>> >
>> > Russell Smithies
>> >
>> > Bioinformatics Applications Developer
>> > T +64 3 489 9085
>> > E
>> >  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>
>> >
>> > Invermay  Research Centre
>> > Puddle Alley,
>> > Mosgiel,
>> > New Zealand
>> > T  +64 3 489 3809
>> > F  +64 3 489 9174
>> > www.agresearch.co.nz<http://www.agresearch.co.nz>
>> >
>> >
>> > =======================================================================
>> > Attention: The information contained in this message and/or attachments
>> > from AgResearch Limited is intended only for the persons or entities
>> > to which it is addressed and may contain confidential and/or privileged
>> > material. Any review, retransmission, dissemination or other use of, or
>> > taking of any action in reliance upon, this information by persons or
>> > entities other than the intended recipients is prohibited by AgResearch
>> > Limited. If you have received this message in error, please notify the
>> > sender immediately.
>> > =======================================================================
>> >
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> >
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> > Jason Stajich
>> > jason at bioperl.org<mailto:jason at bioperl.org>
>> >
>> >
>> >
>> >
>> >
>> > Jason Stajich
>> >
>> > jason at bioperl.org
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>