[Bioperl-l] Asking for advice on full EMBL extraction

Thu May 7 20:03:58 EDT 2009

I think the problem here though is the size of the sequences rather than too many features.
If one was inclined to bodge/hack and didn't care about sequence, I guess you could filter them out with awk so Bio::SeqIO doesn't have to create the Bio::PrimarySeq :)
Probably breaks the EMBL file spec ...
Eg.
open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" ) or die;

--Russell

From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich
Sent: Friday, 8 May 2009 11:25 a.m.
To: Smithies, Russell
Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

It parses from a stream or file, one sequence at a time so it only reads a single sequence out at a time, but it does have to parse that whole sequence record which is where feature rich sequences might be causing problems.

I think per your other mention of Tie::File - the whole file is not going into memory so that is not the problem, it is the creation of many objects that it does as it parses the sequence that is likely the problem.  It will read up to the first "//" from that Tie::File anyways, that becomes an entire string which is then parsed to pull out the relevant features so you don't gain anything with Tie::File -- what would be the way to solve it is if the objects could be created and reside in a DB on disk rather than in-memory.  I'd really enjoy seeing more indexed and hashed data to objects stored on disk when mem requirements are such so that very large datasets can be handled more nimbly.

I think there have been several attempts to simplify, but it basically means a dedicated developer to really overhaul or map to a new system.  What we've tried to build is a decent API so a new implementation can be done without affecting the 'next_seq' and 'write_seq' API.

Non-withstanding the seemed API confusion caused by _ancient_ decisions on giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' which return different types -- don't forget that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a sequence as a string as well so major API changes in general here will create in all likelihood a big split between the branches that will make any new Bioperl not match up well with existing scripts or libraries that use it - hence the reason for no "great realigning" to a completely well-planned out API rather than the organically grown whims of several generations of devs.  I say this in jest a bit - I do want to see changes, but I think it really will have to be called something else besides BioPerl to avoid confusion and the fact that a lot of things will break that depend on the current APIs.  BioPerl2 or something indicating a Perl6 association.

-jason
On May 7, 2009, at 3:05 PM, Smithies, Russell wrote:

OK, I misunderstood, I thought the entire file loaded was loaded into memory first then each sequence was extracted from there.
I hoped splitting into 588 individual sequences might help.

--Russell

From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich
Sent: Friday, 8 May 2009 9:55 a.m.
To: Smithies, Russell
Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org<mailto:'bioperl-l at lists.open-bio.org>'
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

Russell -

I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter.

If the equivalent data exists in genbank format at NCBI I think _that_  module (Bio::SeqIO::genbank) has the ability to ignore annotations/features.  Really we have to re-work the whole thing to be more lightweight and lazy-parse.

-jason
On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:

I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work.
Give this a go:

============================
#!perl -w

use Bio::SeqIO;
use IO::String;

use constant SEP => "//\n";

open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;

my $index = 1;

while(my $stringfh = new IO::String(get_next_record($fh))){

         my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL" ) or die $!;

         while ( my $seq_object = $seqio->next_seq ) {
          print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";

          # show the features
          for my $feat_object ($seq_object->get_SeqFeatures) {
                       print "primary tag: ", $feat_object->primary_tag, "\n";
                       for my $tag ($feat_object->get_all_tags) {
                          print "  tag: ", $tag, "\n";
                          for my $value ($feat_object->get_tag_values($tag)) {
                             print "    value: ", $value, "\n";
                          }
                       }
                     }
         }

}

sub get_next_record{
         my($fh) = @_;
         (my $old_sep,$/) = ($/,SEP);
         my $record = <$fh>;
         $/ = $old_sep;
         return $record;
}
========================================

--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of brian li
Sent: Friday, 8 May 2009 1:00 a.m.
To: Chris Fields
Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

My server has 32 GB RAM.

The os of my server is 64-bit version of Ubuntu Server Edition 8.04
LTS. And I have run my example code on another server with 32-bit
version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.

-Brian

On Thu, May 7, 2009 at 8:07 PM, Chris Fields <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?

chris

On May 7, 2009, at 12:32 AM, brian li wrote:

Thank you very much for your offer.

The director of our lab wants me to do the extraction every time a new
release of EMBL is published. I can't push the task to you every time.

I can offer more information of the server I run my script on if needed.

-Brian

On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

Sadly, that's the same code as I ran but I had a Data::Dump in the
middle.
Versions of Perl and BioPerl are the same.
We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM

If you get a full script running on a smaller dataset, I could probably
run it on the bigger stuff and give you back tab-separated (or is that
tab\tseparated ?) data for loading into your db.

--Russell

-----Original Message-----
From: brian li [mailto:brianli.cas at gmail.com]
Sent: Thursday, 7 May 2009 4:50 p.m.
To: Smithies, Russell
Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

Dear Russell,

My example code is as following. I omit the parse process and these
lines give me "Segmentation Fault" too.

# Start of code
my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
                                           -format => 'EMBL');
my $index = 1;
while (my $seq = $seqio->next_seq)
{
  print "Dealing with entry: $index\n";
  $index++;
}
# End

The platform I run this code on:
BioPerl 1.6.0
Perl 5.8.8
Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)

I have monitored the memory usage when I run the code above. There is
always around 20GB free memory (buffer size counted in) left. So I
suppose the segfault can't be explained just by memory shortage.

Brian

On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

Hi Brian,
I hate to say it but it worked OK for me using
rel_ann_mus_01_r99.dat.gz and

simple example Bio::SeqIO code from bugzilla

It's not using more than 1GB memory on our server and doesn't segfault.

Send me your example code and I'll give it a go if you like.

Russell Smithies

Bioinformatics Applications Developer
T +64 3 489 9085
E  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>

Invermay  Research Centre
Puddle Alley,
Mosgiel,
New Zealand
T  +64 3 489 3809
F  +64 3 489 9174
www.agresearch.co.nz<http://www.agresearch.co.nz>

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org<mailto:jason at bioperl.org>

Jason Stajich
jason at bioperl.org<mailto:jason at bioperl.org>