[Bioperl-l] Asking for advice on full EMBL extraction

Fri May 8 14:48:32 UTC 2009

open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk
'!/^FT|^CO/{print}' |" works.
open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^
/{print}' |" segfaults.

So it seems the features are causing problems. Although I still don't
know how that hurts my os to pop a segfault, my extraction can move on
again. Maybe I can find a clue when I know more about my os's memory
management strategy.

Really appreciate all your help.

-Brian

On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz> wrote:
> I think the problem here though is the size of the sequences rather than too
> many features.
>
> If one was inclined to bodge/hack and didn’t care about sequence, I guess
> you could filter them out with awk so Bio::SeqIO doesn’t have to create the
> Bio::PrimarySeq J
>
> Probably breaks the EMBL file spec …
>
> Eg.
>
> open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |"
> ) or die;
>
>
>
>
>
> --Russell
>
>
>
>
>
>
>
> From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
> Stajich
> Sent: Friday, 8 May 2009 11:25 a.m.
> To: Smithies, Russell
> Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
>
>
> It parses from a stream or file, one sequence at a time so it only reads a
> single sequence out at a time, but it does have to parse that whole sequence
> record which is where feature rich sequences might be causing problems.
>
>
>
> I think per your other mention of Tie::File - the whole file is not going
> into memory so that is not the problem, it is the creation of many objects
> that it does as it parses the sequence that is likely the problem.  It will
> read up to the first "//" from that Tie::File anyways, that becomes an
> entire string which is then parsed to pull out the relevant features so you
> don't gain anything with Tie::File -- what would be the way to solve it is
> if the objects could be created and reside in a DB on disk rather than
> in-memory.  I'd really enjoy seeing more indexed and hashed data to objects
> stored on disk when mem requirements are such so that very large datasets
> can be handled more nimbly.
>
>
>
> I think there have been several attempts to simplify, but it basically means
> a dedicated developer to really overhaul or map to a new system.  What we've
> tried to build is a decent API so a new implementation can be done without
> affecting the 'next_seq' and 'write_seq' API.
>
>
>
> Non-withstanding the seemed API confusion caused by _ancient_ decisions on
> giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq'
> which return different types -- don't forget that Lincoln's Bio::DB::Fasta
> uses the 'seq' method to return a sequence as a string as well so major API
> changes in general here will create in all likelihood a big split between
> the branches that will make any new Bioperl not match up well with existing
> scripts or libraries that use it - hence the reason for no "great
> realigning" to a completely well-planned out API rather than the organically
> grown whims of several generations of devs.  I say this in jest a bit - I do
> want to see changes, but I think it really will have to be called something
> else besides BioPerl to avoid confusion and the fact that a lot of things
> will break that depend on the current APIs.  BioPerl2 or something
> indicating a Perl6 association.
>
>
>
> -jason
>
> On May 7, 2009, at 3:05 PM, Smithies, Russell wrote:
>
> OK, I misunderstood, I thought the entire file loaded was loaded into memory
> first then each sequence was extracted from there.
> I hoped splitting into 588 individual sequences might help.
>
> --Russell
>
> From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason
> Stajich
> Sent: Friday, 8 May 2009 9:55 a.m.
> To: Smithies, Russell
> Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> Russell -
>
> I am not sure how that will help as only 1 sequence is parsed at a time by
> SeqIO parsers and they use the "//" delimiter.
>
> If the equivalent data exists in genbank format at NCBI I think _that_
>  module (Bio::SeqIO::genbank) has the ability to ignore
> annotations/features.  Really we have to re-work the whole thing to be more
> lightweight and lazy-parse.
>
> -jason
> On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:
>
>
> I'm not sure if this will help with your problem or how it deals with memory
> management but using "ordinary" Perl to split the large EMBL file might
> work.
> Give this a go:
>
> ============================
> #!perl -w
>
> use Bio::SeqIO;
> use IO::String;
>
> use constant SEP => "//\n";
>
> open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;
>
> my $index = 1;
>
> while(my $stringfh = new IO::String(get_next_record($fh))){
>
>          my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL"
> ) or die $!;
>
>          while ( my $seq_object = $seqio->next_seq ) {
>           print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";
>
>           # show the features
>           for my $feat_object ($seq_object->get_SeqFeatures) {
>                        print "primary tag: ", $feat_object->primary_tag,
> "\n";
>                        for my $tag ($feat_object->get_all_tags) {
>                           print "  tag: ", $tag, "\n";
>                           for my $value ($feat_object->get_tag_values($tag))
> {
>                              print "    value: ", $value, "\n";
>                           }
>                        }
>                      }
>          }
>
> }
>
>
> sub get_next_record{
>          my($fh) = @_;
>          (my $old_sep,$/) = ($/,SEP);
>          my $record = <$fh>;
>          $/ = $old_sep;
>          return $record;
> }
> ========================================
>
>
> --Russell
>
>
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of
> brian li
> Sent: Friday, 8 May 2009 1:00 a.m.
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> My server has 32 GB RAM.
>
> The os of my server is 64-bit version of Ubuntu Server Edition 8.04
> LTS. And I have run my example code on another server with 32-bit
> version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
>
> -Brian
>
> On Thu, May 7, 2009 at 8:07 PM, Chris Fields
> <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
> I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?
>
> chris
>
> On May 7, 2009, at 12:32 AM, brian li wrote:
>
> Thank you very much for your offer.
>
> The director of our lab wants me to do the extraction every time a new
> release of EMBL is published. I can't push the task to you every time.
>
> I can offer more information of the server I run my script on if needed.
>
> -Brian
>
> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
> wrote:
>
> Sadly, that's the same code as I ran but I had a Data::Dump in the
> middle.
> Versions of Perl and BioPerl are the same.
> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
>
> If you get a full script running on a smaller dataset, I could probably
> run it on the bigger stuff and give you back tab-separated (or is that
> tab\tseparated ?) data for loading into your db.
>
> --Russell
>
> -----Original Message-----
> From: brian li [mailto:brianli.cas at gmail.com]
> Sent: Thursday, 7 May 2009 4:50 p.m.
> To: Smithies, Russell
> Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> Dear Russell,
>
> My example code is as following. I omit the parse process and these
> lines give me "Segmentation Fault" too.
>
> # Start of code
> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
>                                            -format => 'EMBL');
> my $index = 1;
> while (my $seq = $seqio->next_seq)
> {
>   print "Dealing with entry: $index\n";
>   $index++;
> }
> # End
>
> The platform I run this code on:
> BioPerl 1.6.0
> Perl 5.8.8
> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
>
> I have monitored the memory usage when I run the code above. There is
> always around 20GB free memory (buffer size counted in) left. So I
> suppose the segfault can't be explained just by memory shortage.
>
> Brian
>
>
> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>>
> wrote:
>
> Hi Brian,
> I hate to say it but it worked OK for me using
> rel_ann_mus_01_r99.dat.gz and
>
> simple example Bio::SeqIO code from bugzilla
>
> It's not using more than 1GB memory on our server and doesn't segfault.
>
> Send me your example code and I'll give it a go if you like.
>
>
> Russell Smithies
>
> Bioinformatics Applications Developer
> T +64 3 489 9085
> E
>  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>
>
> Invermay  Research Centre
> Puddle Alley,
> Mosgiel,
> New Zealand
> T  +64 3 489 3809
> F  +64 3 489 9174
> www.agresearch.co.nz<http://www.agresearch.co.nz>
>
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> jason at bioperl.org<mailto:jason at bioperl.org>
>
>
>
>
>
> Jason Stajich
>
> jason at bioperl.org
>
>
>
>
>
>
>
>