[Bioperl-l] Asking for advice on full EMBL extraction

Thu May 7 23:04:52 UTC 2009

I guess Tie::File is going to do the same thing?
(this works on my 32-bit Windows pc with 2GB RAM but is slow)

--Russell

=====================

#!perl -w

use Bio::SeqIO;
use IO::String;

use Tie::File;

tie @array, 'Tie::File', "rel_ann_mus_01_r99.dat", recsep => "//\n" or die $!;

print "loaded ". $#array." records\n";

for (my $i = 0; $i < $#array; $i++) {
                print "$i\n";
                my $seqio = Bio::SeqIO->new( -fh => new IO::String($array[$i]), -format => "EMBL" ) or die $!;

            # should only be one seq
                my $seq_object = $seqio->next_seq;
                print "Dealing with entry: $i\t" . $seq_object->id . "\n";
}

=====================

From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich
Sent: Friday, 8 May 2009 9:55 a.m.
To: Smithies, Russell
Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org'
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

Russell -

I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter.

If the equivalent data exists in genbank format at NCBI I think _that_  module (Bio::SeqIO::genbank) has the ability to ignore annotations/features.  Really we have to re-work the whole thing to be more lightweight and lazy-parse.

-jason
On May 7, 2009, at 2:24 PM, Smithies, Russell wrote:

I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work.
Give this a go:

============================
#!perl -w

use Bio::SeqIO;
use IO::String;

use constant SEP => "//\n";

open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die;

my $index = 1;

while(my $stringfh = new IO::String(get_next_record($fh))){

          my $seqio = Bio::SeqIO->new( -fh     => $stringfh,-format => "EMBL" ) or die $!;

          while ( my $seq_object = $seqio->next_seq ) {
           print "Dealing with entry: ".$index++."\t".$seq_object->id."\n";

           # show the features
           for my $feat_object ($seq_object->get_SeqFeatures) {
                        print "primary tag: ", $feat_object->primary_tag, "\n";
                        for my $tag ($feat_object->get_all_tags) {
                           print "  tag: ", $tag, "\n";
                           for my $value ($feat_object->get_tag_values($tag)) {
                              print "    value: ", $value, "\n";
                           }
                        }
                      }
          }

}

sub get_next_record{
          my($fh) = @_;
          (my $old_sep,$/) = ($/,SEP);
          my $record = <$fh>;
          $/ = $old_sep;
          return $record;
}
========================================

--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of brian li
Sent: Friday, 8 May 2009 1:00 a.m.
To: Chris Fields
Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

My server has 32 GB RAM.

The os of my server is 64-bit version of Ubuntu Server Edition 8.04
LTS. And I have run my example code on another server with 32-bit
version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again.

-Brian

On Thu, May 7, 2009 at 8:07 PM, Chris Fields <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
I noticed that Russell has 16GB RAM on his setup.  Was yours equivalent?

chris

On May 7, 2009, at 12:32 AM, brian li wrote:

Thank you very much for your offer.

The director of our lab wants me to do the extraction every time a new
release of EMBL is published. I can't push the task to you every time.

I can offer more information of the server I run my script on if needed.

-Brian

On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

Sadly, that's the same code as I ran but I had a Data::Dump in the
middle.
Versions of Perl and BioPerl are the same.
We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM

If you get a full script running on a smaller dataset, I could probably
run it on the bigger stuff and give you back tab-separated (or is that
tab\tseparated ?) data for loading into your db.

--Russell

-----Original Message-----
From: brian li [mailto:brianli.cas at gmail.com]
Sent: Thursday, 7 May 2009 4:50 p.m.
To: Smithies, Russell
Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction

Dear Russell,

My example code is as following. I omit the parse process and these
lines give me "Segmentation Fault" too.

# Start of code
my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat',
                                            -format => 'EMBL');
my $index = 1;
while (my $seq = $seqio->next_seq)
{
   print "Dealing with entry: $index\n";
   $index++;
}
# End

The platform I run this code on:
BioPerl 1.6.0
Perl 5.8.8
Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server)

I have monitored the memory usage when I run the code above. There is
always around 20GB free memory (buffer size counted in) left. So I
suppose the segfault can't be explained just by memory shortage.

Brian

On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

Hi Brian,
I hate to say it but it worked OK for me using
rel_ann_mus_01_r99.dat.gz and

simple example Bio::SeqIO code from bugzilla

It's not using more than 1GB memory on our server and doesn't segfault.

Send me your example code and I'll give it a go if you like.

Russell Smithies

Bioinformatics Applications Developer
T +64 3 489 9085
E  russell.smithies at agresearch.co.nz<mailto:russell.smithies at agresearch.co.nz>

Invermay  Research Centre
Puddle Alley,
Mosgiel,
New Zealand
T  +64 3 489 3809
F  +64 3 489 9174
www.agresearch.co.nz<http://www.agresearch.co.nz>

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org<mailto:jason at bioperl.org>