[Bioperl-l] Asking for advice on full EMBL extraction

Thu May 7 19:25:16 EDT 2009

It parses from a stream or file, one sequence at a time so it only  
reads a single sequence out at a time, but it does have to parse that  
whole sequence record which is where feature rich sequences might be  
causing problems.

I think per your other mention of Tie::File - the whole file is not  
going into memory so that is not the problem, it is the creation of  
many objects that it does as it parses the sequence that is likely the  
problem.  It will read up to the first "//" from that Tie::File  
anyways, that becomes an entire string which is then parsed to pull  
out the relevant features so you don't gain anything with Tie::File --  
what would be the way to solve it is if the objects could be created  
and reside in a DB on disk rather than in-memory.  I'd really enjoy  
seeing more indexed and hashed data to objects stored on disk when mem  
requirements are such so that very large datasets can be handled more  
nimbly.

I think there have been several attempts to simplify, but it basically  
means a dedicated developer to really overhaul or map to a new  
system.  What we've tried to build is a decent API so a new  
implementation can be done without affecting the 'next_seq' and  
'write_seq' API.

Non-withstanding the seemed API confusion caused by _ancient_  
decisions on giving function names of Bio::SeqFeatureI 'seq' and  
Bio::PrimarySeq 'seq' which return different types -- don't forget  
that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a  
sequence as a string as well so major API changes in general here will  
create in all likelihood a big split between the branches that will  
make any new Bioperl not match up well with existing scripts or  
libraries that use it - hence the reason for no "great realigning" to  
a completely well-planned out API rather than the organically grown  
whims of several generations of devs.  I say this in jest a bit - I do  
want to see changes, but I think it really will have to be called  
something else besides BioPerl to avoid confusion and the fact that a  
lot of things will break that depend on the current APIs.  BioPerl2 or  
something indicating a Perl6 association.

-jason
On May 7, 2009, at 3:05 PM, Smithies, Russell wrote:

> OK, I misunderstood, I thought the entire file loaded was loaded  
> into memory first then each sequence was extracted from there.
> I hoped splitting into 588 individual sequences might help.
>
> --Russell
>
> From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of  
> Jason Stajich
> Sent: Friday, 8 May 2009 9:55 a.m.
> To: Smithies, Russell
> Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> Russell -
>
> I am not sure how that will help as only 1 sequence is parsed at a  
> time by SeqIO parsers and they use the "//" delimiter.
>
> If the equivalent data exists in genbank format at NCBI I think  
> _that_  module (Bio::SeqIO::genbank) has the ability to ignore  
> annotations/features.  Really we have to re-work the whole thing to  
> be more lightweight and lazy-parse.
>
> -jason
> On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:
>
>
> I'm not sure if this will help with your problem or how it deals  
> with memory management but using "ordinary" Perl to split the large  
> EMBL file might work.
> Give this a go:
>
> ============================
> #!perl -w
>
> use Bio::SeqIO;
> use IO::String;
>
> use constant SEP => "//\n";
>
> open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;
>
> my $index = 1;
>
> while(my $stringfh = new IO::String(get_next_record($fh))){
>
>          my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format  
> => "EMBL" ) or die $!;
>
>          while ( my $seq_object = $seqio->next_seq ) {
>           print "Dealing with entry: ".$index++."\t".$seq_object- 
> >id."\n";
>
>           # show the features
>           for my $feat_object ($seq_object->get_SeqFeatures) {
>                        print "primary tag: ", $feat_object- 
> >primary_tag, "\n";
>                        for my $tag ($feat_object->get_all_tags) {
>                           print "  tag: ", $tag, "\n";
>                           for my $value ($feat_object- 
> >get_tag_values($tag)) {
>                              print "    value: ", $value, "\n";
>                           }
>                        }
>                      }
>          }
>
> }
>
>
> sub get_next_record{
>          my($fh) = @_;
>          (my $old_sep,$/) = ($/,SEP);
>          my $record = <$fh>;
>          $/ = $old_sep;
>          return $record;
> }
> ========================================
>
>
> --Russell
>
>
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On  
> Behalf Of brian li
> Sent: Friday, 8 May 2009 1:00 a.m.
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> My server has 32 GB RAM.
>
> The os of my server is 64-bit version of Ubuntu Server Edition 8.04
> LTS. And I have run my example code on another server with 32-bit
> version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.
>
> -Brian
>
> On Thu, May 7, 2009 at 8:07 PM, Chris Fields <cjfields at illinois.edu<mailto:cjfields at illinois.edu 
> >> wrote:
> I noticed that Russell has 16GB RAM on his setup.  Was yours  
> equivalent?
>
> chris
>
> On May 7, 2009, at 12:32 AM, brian li wrote:
>
> Thank you very much for your offer.
>
> The director of our lab wants me to do the extraction every time a new
> release of EMBL is published. I can't push the task to you every time.
>
> I can offer more information of the server I run my script on if  
> needed.
>
> -Brian
>
> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz 
> >> wrote:
>
> Sadly, that's the same code as I ran but I had a Data::Dump in the
> middle.
> Versions of Perl and BioPerl are the same.
> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM
>
> If you get a full script running on a smaller dataset, I could  
> probably
> run it on the bigger stuff and give you back tab-separated (or is that
> tab\tseparated ?) data for loading into your db.
>
> --Russell
>
> -----Original Message-----
> From: brian li [mailto:brianli.cas at gmail.com]
> Sent: Thursday, 7 May 2009 4:50 p.m.
> To: Smithies, Russell
> Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction
>
> Dear Russell,
>
> My example code is as following. I omit the parse process and these
> lines give me "Segmentation Fault" too.
>
> # Start of code
> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
>                                            -format => 'EMBL');
> my $index = 1;
> while (my $seq = $seqio->next_seq)
> {
>   print "Dealing with entry: $index\n";
>   $index++;
> }
> # End
>
> The platform I run this code on:
> BioPerl 1.6.0
> Perl 5.8.8
> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)
>
> I have monitored the memory usage when I run the code above. There is
> always around 20GB free memory (buffer size counted in) left. So I
> suppose the segfault can't be explained just by memory shortage.
>
> Brian
>
>
> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz 
> >> wrote:
>
> Hi Brian,
> I hate to say it but it worked OK for me using
> rel_ann_mus_01_r99.dat.gz and
>
> simple example Bio::SeqIO code from bugzilla
>
> It's not using more than 1GB memory on our server and doesn't  
> segfault.
>
> Send me your example code and I'll give it a go if you like.
>
>
> Russell Smithies
>
> Bioinformatics Applications Developer
> T +64 3 489 9085
> E  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz 
> >
>
> Invermay  Research Centre
> Puddle Alley,
> Mosgiel,
> New Zealand
> T  +64 3 489 3809
> F  +64 3 489 9174
> www.agresearch.co.nz<http://www.agresearch.co.nz>
>
>
> = 
> ======================================================================
> Attention: The information contained in this message and/or  
> attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or  
> privileged
> material. Any review, retransmission, dissemination or other use of,  
> or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by  
> AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> = 
> ======================================================================
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> jason at bioperl.org<mailto:jason at bioperl.org>
>
>
>
>

Jason Stajich
jason at bioperl.org