[Bioperl-l] Re: Large File Manipulation

darin.m.london@gsk.com darin.m.london@gsk.com
Wed, 24 Jul 2002 08:33:34 -0400


Dinakar,
Are you able to cat the file from linux?  If so, you might just feed a pipe
to BioPerl and not worry about which file open library it is using. You can
send a pipe to the -file parameter.

#!/usr/bin/env perl

use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';

use Bio::SeqIO;

$seqio = Bio::SeqIO->new( -file =>'cat /home/desas2/data/nt |', '-format'
=>
'Fasta');

$seqobj = $seqio->next_seq();
$count = 5;
while ($count > 0){
         print $seqobj->seq();
         $seqobj = $seqio->next_seq();

}

-
Darin M. London
Bioinformatics Investigator
R & D - Bioinformatics Operations
GlaxoSmithKline
Internal
MAIN A1909.1f

External
5 Moore Drive
P.O. Box 13398
MAIN A1909.1f
Research Triangle Park
NC 27709
Phone: (919) 483 - 0710



                                                                                                                     
                    bioperl-l-request@b                                                                              
                    ioperl.org                                                                                       
                                                                                                                     
                    Sent by:                  To:     bioperl-l                                                      
                    bioperl-l-admin@bio                                                                              
                    perl.org                  cc:                                                                    
                                              Subject:     Bioperl-l digest, Vol 1 #824 - 9 msgs                     
                                                                                                                     
                    23-Jul-2002 19:11                                                                                
                    Please respond to                                                                                
                    bioperl-l@bioperl.o                                                                              
                    rg                                                                                               
                                                                                                                     
                                                                                                                     



Send Bioperl-l mailing list submissions to
           bioperl-l@bioperl.org

To subscribe or unsubscribe via the World Wide Web, visit
           http://bioperl.org/mailman/listinfo/bioperl-l
or, via email, send a message with subject or body 'help' to
           bioperl-l-request@bioperl.org

You can reach the person managing the list at
           bioperl-l-admin@bioperl.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioperl-l digest..."


Today's Topics:

   1. Re: [Biojava-l] Re: [Bioperl-l] RE: [Open-bio-l] seq namespace method
(Brian Gilman)
   2. Re: [Biojava-l] Re: [Bioperl-l] RE: [Open-bio-l] seq namespace method
(Matthew Pocock)
   3. Re: Windows vs Linux vs ??? (Jason Stajich)
   4. writing remote blasts to file (Richard Adams)
   5. Re: remove.t in bioperl-db (Ewan Birney)
   6. RE: Identifiable and Describable (Hilmar Lapp)
   7. need help with large genbank file (Dinakar Desai)
   8. Re: need help with large genbank file (Chris Dagdigian)
   9. Re: need help with large genbank file (Dinakar Desai)

--__--__--

Message: 1
Date: Tue, 23 Jul 2002 16:25:41 -0400 (EDT)
From: Brian Gilman <gilmanb@genome.wi.mit.edu>
To: "Michael L. Heuer" <heuermh@acm.org>
cc: Matthew Pocock <matthew_pocock@yahoo.co.uk>,
       Ewan Birney <birney@ebi.ac.uk>, Hilmar Lapp <hlapp@gnf.org>,
       lstein@cshl.org, sac@bioperl.org,
       "OBDA BioSQL (E-mail)" <open-bio-l@open-bio.org>,
       "BioPerl (E-mail)" <bioperl-l@bioperl.org>, biojava-l@biojava.org
Subject: Re: [Biojava-l] Re: [Bioperl-l] RE: [Open-bio-l] seq namespace
method

Hello Guys,

           I'd like to work with Michael to see where we differ in out LSID
implementations and where we agree. Then we can both submit an
implementation to biojava.

           I'd like to note that this is still a work in progress!! We are
in
the midst of gathering feedback on the spec and will produce another draft
after our next I3C meeting which is next week.

           Michael, we should get together and discuss this over a
beer...You're only down the road!

                                                    Best,

                                                              -B

-----------------------
Brian Gilman <gilmanb@genome.wi.mit.edu>
Group Leader Medical & Population Genetics Dept.
MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617  252 1069 / fax +1 617 252 1902


On Mon, 22 Jul 2002, Michael L. Heuer wrote:

>
> Hello Matthew,
>
> It's probably best to wait a bit before importing this into biojava
> proper.  Things are still a little bit up in the air here on the list(s),
> and I believe there is a LSID implementation in the omnigene codebase
> that we should take a look at before committing to any one design.
>
> I like the idea of removing the accessors from the Identifiable
interface,
> but would advocate leaving them in the abstract and support
implementations.
>
>    michael
>
>
> On Fri, 19 Jul 2002, Matthew Pocock wrote:
>
> > Great Michael,
> >
> > LifeScienceIdentifier looks good. I'd prefer Identifiable to drop all
> > methods except getPreferredIdentifier() and getIdentifiers(). I think
> > the other methods duplicate already available information (e.g.
> > getAuthority etc can be found by calling
> > getPreferredIdentifier().getAuthority() ). Putting functionality/api in
> > exactly one place is generaly a good idea. What are your thoughts?
> >
> > What package would you like to put these classes/interfaces into? An
> > existing package, or org.biojava.bio.lsid or
org.biojava.bio.program.lsid?
> >
> > Matthew
> >
> > Michael L. Heuer wrote:
> > > Having spent too much time in meetings lately, I felt the need to
> > > actually code something, and whipped up a java implementation of
LSIDs
> > > based on the discussion here.  Being that this has been a moving
target, I
> > > may not have nailed it exactly right -- I went with containment and
the
> > > idea of a "preferred identifier", eg.
> > >
> > > interface LifeScienceIdentifier extends Immutable {
> > >   public String getAuthority();
> > >   public String getNamespace();
> > >   public Object getObjectId();
> > >   public Object getVersion();
> > >
> > >   public String asIdentifier();  // authority:namespace:object_id
> > >   public String asCommonName();  //  namespace:object_id.version
> > > }
> > >
> > > interface Identifiable {
> > >   // delegate to preferred
> > >   public String getAuthority();
> > >   public String getNamespace();
> > >   public Object getObjectId();
> > >   public Object getVersion();
> > >
> > >   public LifeScienceIdentifier getPreferredIdentifier();
> > >   public boolean hasMultipleIdentifiers();
> > >   public List getIdentifiers();
> > > }
> > >
> > > A tarball with implementations, factories, junit tests, etc. is
available
> > >
> > >
> > >>http://shore.net/~heuermh/lsid-PROPOSAL.tar.gz
> > >
> > >
> > >    michael
> > >
> > >
> >
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l@open-bio.org
> > http://open-bio.org/mailman/listinfo/open-bio-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>


--__--__--

Message: 2
Date: Tue, 23 Jul 2002 21:59:39 +0100
From: Matthew Pocock <matthew_pocock@yahoo.co.uk>
To: Brian Gilman <gilmanb@genome.wi.mit.edu>
CC: "Michael L. Heuer" <heuermh@acm.org>, Ewan Birney <birney@ebi.ac.uk>,
       Hilmar Lapp <hlapp@gnf.org>, lstein@cshl.org, sac@bioperl.org,
       "OBDA BioSQL
 (E-mail)" <open-bio-l@open-bio.org>,
       "BioPerl (E-mail)"
 <bioperl-l@bioperl.org>, biojava-l@biojava.org
Subject: Re: [Biojava-l] Re: [Bioperl-l] RE: [Open-bio-l] seq namespace
method

Hi Brian.

Great to hear that you're interested. The bottom line from my point of
view is that any LSID object should natively fit into the java naming &
directory system so that it's a resolvable name just like all the other
name implementations. That way we are not tied to LSID unnecisarialy,
but we are able to make sensible name objects for embl entries et al.
Everything else is up to you guys.

Perhaps if there was a Nameable interface with getName() that returns a
jndi name object, and we mixin this interface as needed? Oh, I don't
know. Come up with something. Suprise us.

I look forward to seeing (or not noticing ;-) ) what you come up with.

Matthew

ps Will either of you be at BOSC or ISMB? There will be beer there :-)

Brian Gilman wrote:
> Hello Guys,
>
>          I'd like to work with Michael to see where we differ in out LSID
> implementations and where we agree. Then we can both submit an
> implementation to biojava.
>
>          I'd like to note that this is still a work in progress!! We are
in
> the midst of gathering feedback on the spec and will produce another
draft
> after our next I3C meeting which is next week.
>
>          Michael, we should get together and discuss this over a
> beer...You're only down the road!
>
>                                                   Best,
>
>                                                             -B
>
> -----------------------
> Brian Gilman <gilmanb@genome.wi.mit.edu>
> Group Leader Medical & Population Genetics Dept.
> MIT/Whitehead Inst. Center for Genome Research
> One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> phone +1 617  252 1069 / fax +1 617 252 1902
>
>
> On Mon, 22 Jul 2002, Michael L. Heuer wrote:
>
>
>>Hello Matthew,
>>
>>It's probably best to wait a bit before importing this into biojava
>>proper.  Things are still a little bit up in the air here on the list(s),
>>and I believe there is a LSID implementation in the omnigene codebase
>>that we should take a look at before committing to any one design.
>>
>>I like the idea of removing the accessors from the Identifiable
interface,
>>but would advocate leaving them in the abstract and support
implementations.
>>
>>   michael
>>
>>
>>On Fri, 19 Jul 2002, Matthew Pocock wrote:
>>
>>
>>>Great Michael,
>>>
>>>LifeScienceIdentifier looks good. I'd prefer Identifiable to drop all
>>>methods except getPreferredIdentifier() and getIdentifiers(). I think
>>>the other methods duplicate already available information (e.g.
>>>getAuthority etc can be found by calling
>>>getPreferredIdentifier().getAuthority() ). Putting functionality/api in
>>>exactly one place is generaly a good idea. What are your thoughts?
>>>
>>>What package would you like to put these classes/interfaces into? An
>>>existing package, or org.biojava.bio.lsid or
org.biojava.bio.program.lsid?
>>>
>>>Matthew
>>>
>>>Michael L. Heuer wrote:
>>>
>>>>Having spent too much time in meetings lately, I felt the need to
>>>>actually code something, and whipped up a java implementation of LSIDs
>>>>based on the discussion here.  Being that this has been a moving
target, I
>>>>may not have nailed it exactly right -- I went with containment and the
>>>>idea of a "preferred identifier", eg.
>>>>
>>>>interface LifeScienceIdentifier extends Immutable {
>>>>  public String getAuthority();
>>>>  public String getNamespace();
>>>>  public Object getObjectId();
>>>>  public Object getVersion();
>>>>
>>>>  public String asIdentifier();  // authority:namespace:object_id
>>>>  public String asCommonName();  //  namespace:object_id.version
>>>>}
>>>>
>>>>interface Identifiable {
>>>>  // delegate to preferred
>>>>  public String getAuthority();
>>>>  public String getNamespace();
>>>>  public Object getObjectId();
>>>>  public Object getVersion();
>>>>
>>>>  public LifeScienceIdentifier getPreferredIdentifier();
>>>>  public boolean hasMultipleIdentifiers();
>>>>  public List getIdentifiers();
>>>>}
>>>>
>>>>A tarball with implementations, factories, junit tests, etc. is
available
>>>>
>>>>
>>>>
>>>>>http://shore.net/~heuermh/lsid-PROPOSAL.tar.gz
>>>>
>>>>
>>>>   michael
>>>>
>>>>
>>>
>>>_______________________________________________
>>>Open-Bio-l mailing list
>>>Open-Bio-l@open-bio.org
>>>http://open-bio.org/mailman/listinfo/open-bio-l
>>>
>>
>>_______________________________________________
>>Biojava-l mailing list  -  Biojava-l@biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
>
>




--__--__--

Message: 3
Date: Tue, 23 Jul 2002 12:53:27 -0400 (EDT)
From: Jason Stajich <jason@cgt.mc.duke.edu>
To: Michael J Scott <mscott.law@verizon.net>
cc: <bioperl-l@bioperl.org>
Subject: Re: [Bioperl-l] Windows vs Linux vs ???

use linux or freebsd.  you'll be happier in the long run.
my opinion only of course.

On Sun, 21 Jul 2002, Michael J Scott wrote:

> I am a new graduate student who has been tasked with developing some
> level of expertise in bioinformatics and perl.  I currently work on a
> Windows 98 platform.  If I stay with Windows, am I just buying myself
> problems down the road?  Should I consider developing some Linux
> expertise and creating a Linux partition for the BioPerl work?
>
> Thanks for any advice.  Opinions welcome.  Slanderous comments about
> MS gladly accepted.
>
> Michael Scott
> University of Texas at Arlington
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


--__--__--

Message: 4
Date: Tue, 23 Jul 2002 12:32:18 +0100
From: Richard Adams <Richard.Adams@ed.ac.uk>
Organization: Department of Medical Sciences, Edinburgh University
To: bioperl-l@bioperl.org
Subject: [Bioperl-l] writing remote blasts to file

   Hello,
I'm sure I'm missing something obvious but am trying just to get a
remote blast report to file for future
import into SearchIO for parsing.  Since there doesn't seem to be the
equivalent to the standaloneblast's
    $factory->outfile('>OUT.blast')
I'm trying to use the TableWriter modules

e.g.,
use Bio::Tools::RemoteBlast
use Bio::SearchIO;
use Bio::SearchIO::Writer::HSPTableWriter;

    $report = $factory->retrieve_blast ($rid);
#assuming retrieval has worked

$factory->remove_rid($rid);
    my $writer = Bio::SearchIO::Writer::HSPTableWriter->new();

    my $out = Bio::SearchIO->new (-format =>'blast',
                                                               -file =>
">OUT4.out",
                                                               -writer
=> $writer);

    while ( $result = $report->next_result) {
                         $out->write_result($result);

Examining the contents of OUT4.out shows a listing full of 0  or 0.00
instead of the HSP details
If I use ResultTableWriter this just outputs a single integer .
But the blast did generate hits.
How can I just send the whole report to file, and/or get the
HSPTableWriter to access the data in the blast report?
I'm sure this has a trivial answer but would greatly appreciate any
assistance.

Richard Adams
Molecular Medicine Centre
University of Edinburgh
UK



--__--__--

Message: 5
Date: Tue, 23 Jul 2002 08:57:14 +0100 (BST)
From: Ewan Birney <birney@ebi.ac.uk>
To: Hilmar Lapp <hlapp@gnf.org>
cc: "BioPerl (E-mail)" <bioperl-l@bioperl.org>
Subject: Re: [Bioperl-l] remove.t in bioperl-db

On Wed, 10 Jul 2002, Ewan Birney wrote:

>
>
> On Tue, 9 Jul 2002, Hilmar Lapp wrote:
>
> > remove.t needs to be rewritten completely; actual SQL should be absent
> > from test scripts because otherwise there is no way to hook up your
> > adaptor. Would anyone be hurt if I remove remove.t completely?
>
> Not from my perspective...
>

These are very old mail messages from my perspective - not sure if this is
a bioperl problem or an Orange mail network problem....


>
> >
> >
> >        -hilmar
> > --
> > -------------------------------------------------------------
> > Hilmar Lapp                            email: lapp at gnf.org
> > GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> > -------------------------------------------------------------
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------


--__--__--

Message: 6
Subject: RE: [Bioperl-l] Identifiable and Describable
Date: Tue, 23 Jul 2002 14:59:41 -0700
From: "Hilmar Lapp" <hlapp@gnf.org>
To: <birney@ebi.ac.uk>, <bioperl-l@bioperl.org>

I've seen lsid_string() there too, which I think shouldn't be there. It
pertains to a specific implementation of IdentifiableI, namely the LSID
implementation. Another one might have mobyid_string(), or whatever.

This leads to my other suggestion: I'd plead to have Identifiable in most
if not all cases implemented by composition and not direct inheritance.
This keeps the implementation flexible and open. With the design that
Bio::PrimarySeq is-a IdentifiableI you've got to inherit from the
interface, but internally the methods should all delegate to an object that
implements IdentifiableI.

I.e., you would have

           Bio::Identifier::SimpleID is-a IdentifiableI
           Bio::Identifier::LSID is-a IdentifiableI
           Bio::Identifier::MobyID is-a IdentifiableI
           Bio::Identifier::EnsemblID is-a IdentifiableI
           # ... and whatever more you wish

           Bio::PrimarySeqI is-a IdentifiableI

           Bio::PrimarySeq

                     sub identifier { ... } # get/set IdentifiableI impl.

                     sub object_id {
                               my ($self, @args) = @_;
                               return $self->identifiable()->object_id
(@args);
                     }

                     ...

Those who want an LSID implementation can have it. Those who want another
can have it too.

A somewhat unrelated question is how authority and namespace map to biosql.
There is only biodatabase now with a name field ... Do we simply add an
authority attribute? (i.e. one combination of authority/namespace makes one
biodatabase entry?)

           -hilmar


> -----Original Message-----
> From: Ewan Birney [mailto:birney@ebi.ac.uk]
> Sent: Monday, July 22, 2002 4:12 PM
> To: bioperl-l@bioperl.org
> Subject: [Bioperl-l] Identifiable and Describable
>
>
>
> I have put in two interfaces, IdentifiableI and DescribableI into
> bioperl.
>
>
> IdentifiableI defines
>
>    ->object_id
>    ->version
>    ->namespace
>    ->authority
>
> I debated adding the common ideas of display_id and description which
> also are often properties associated with Sequences/Go terms/Pfam
> domains/Interpros whatever, but realised that ideally this is another
> interface, which I described as DescribableI
>
>   Methods
>
>    display_name
>    description
>
>
> I made PrimarySeqI isa both an IdentifiableI and DescribableI
> and adjusted
> PrimarySeq implementaiton - need to also handle Seq and then the SeqIO
> system.
>
>
> This is not written in stone yet, so people can still talk me
> out of this
> route, but it feels comfortable for me so far....
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

--__--__--

Message: 7
Date: Tue, 23 Jul 2002 17:30:52 -0500
From: Dinakar Desai <Desai.Dinakar@mayo.edu>
CC: Bioperl <bioperl-l@bioperl.org>, "Desai, Dinakar"
<Desai.Dinakar@mayo.edu>
Subject: [Bioperl-l] need help with large genbank file

Hello:

I am new to perl and bioperl. I have downloaded file from ncbi
(ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I am
trying to parse this file for certain pattern with Bioperl. I get
error.I have looked into largefasta.pm and they suggest not to use it.
I would appreciate, if you could help me with this problem.

My code to test only 5 records out of this big file is as follows:
<code>
#!/usr/bin/env perl

use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';

use Bio::SeqIO;

$seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
'Fasta');

$seqobj = $seqio->next_seq();
$count = 5;
while ($count > 0){
         print $seqobj->seq();
         $seqobj = $seqio->next_seq();

}
</code>
and the error message is:
<error>
------------ EXCEPTION  -------------
MSG: Could not open /home/desas2/data/nt for reading: File too large
STACK Bio::Root::IO::_initialize_io
/home/desas2/perl_mod/lib/site_perl/5.6.0//B
io/Root/IO.pm:244
STACK Bio::SeqIO::_initialize
/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
IO.pm:381
STACK Bio::SeqIO::new
/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
4
STACK Bio::SeqIO::new
/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
7
STACK toplevel ./test_fasta.pl:8

--------------------------------------
</error>

Do you have any suggestion, how I could get to read this big file and
get sequence object. I know how to manipulate sequence object.

Thank you.

Dinakar





--__--__--

Message: 8
Date: Tue, 23 Jul 2002 19:00:22 -0400
From: Chris Dagdigian <dag@sonsorol.org>
To: Dinakar Desai <Desai.Dinakar@mayo.edu>
CC: Bioperl <bioperl-l@bioperl.org>
Subject: Re: [Bioperl-l] need help with large genbank file


Dinakar,

The file is to big for perl to open a filehandle on (at least that is
what your error message states)

I know from painful experience :) that the file you are trying to read
is larger than 2GB when it is uncompressed into its native form.  If
your computer, filesystem, kernel or operating system cannot handle
files larger than 2GB in size then you will get these sorts of errors.

There are various tricks to make things work. Systems with 64-bit
architectures (like Alphaservers) do not have these problems at all.

Linux solved this in the kernel a long time ago and the common linux
filesystems can all handle large files. There are however binary
programs that you may run into like 'cat', 'more', 'uncompress' etc.
etc. that will coredump or segfault on large files because they were not
compiled to support 64-bit offsets.

Without knowing your operating system or local configuration I'd
recommend that you experiment with breaking NT into several smaller
pieces. You should be able to determine experimentally the filesize
limit that you appear to have.

-Chris




Dinakar Desai wrote:
> Hello:
>
> I am new to perl and bioperl. I have downloaded file from ncbi
> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I am
> trying to parse this file for certain pattern with Bioperl. I get
> error.I have looked into largefasta.pm and they suggest not to use it.
> I would appreciate, if you could help me with this problem.
>
> My code to test only 5 records out of this big file is as follows:
> <code>
> #!/usr/bin/env perl
>
> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
>
> use Bio::SeqIO;
>
> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
> 'Fasta');
>
> $seqobj = $seqio->next_seq();
> $count = 5;
> while ($count > 0){
>         print $seqobj->seq();
>         $seqobj = $seqio->next_seq();


>
> }
> </code>
> and the error message is:
> <error>
> ------------ EXCEPTION  -------------
> MSG: Could not open /home/desas2/data/nt for reading: File too large
> STACK Bio::Root::IO::_initialize_io
> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
> io/Root/IO.pm:244
> STACK Bio::SeqIO::_initialize
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
> IO.pm:381
> STACK Bio::SeqIO::new
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
> 4
> STACK Bio::SeqIO::new
> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
> 7
> STACK toplevel ./test_fasta.pl:8
>
> --------------------------------------
> </error>
>
> Do you have any suggestion, how I could get to read this big file and
> get sequence object. I know how to manipulate sequence object.
>
> Thank you.
>
> Dinakar
>



--
Chris Dagdigian, <dag@sonsorol.org>
Independent life science IT & research computing consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
Work: http://BioTeam.net PGP KeyID: 83D4310E  Yahoo IM: craffi


--__--__--

Message: 9
Date: Tue, 23 Jul 2002 18:10:49 -0500
From: Dinakar Desai <Desai.Dinakar@mayo.edu>
CC: Bioperl <bioperl-l@bioperl.org>
Subject: Re: [Bioperl-l] need help with large genbank file

Chris Dagdigian wrote:
>
> Dinakar,
>
> The file is to big for perl to open a filehandle on (at least that is
> what your error message states)
>
> I know from painful experience :) that the file you are trying to read
> is larger than 2GB when it is uncompressed into its native form.  If
> your computer, filesystem, kernel or operating system cannot handle
> files larger than 2GB in size then you will get these sorts of errors.
>
> There are various tricks to make things work. Systems with 64-bit
> architectures (like Alphaservers) do not have these problems at all.
>
> Linux solved this in the kernel a long time ago and the common linux
> filesystems can all handle large files. There are however binary
> programs that you may run into like 'cat', 'more', 'uncompress' etc.
> etc. that will coredump or segfault on large files because they were not
> compiled to support 64-bit offsets.
>
> Without knowing your operating system or local configuration I'd
> recommend that you experiment with breaking NT into several smaller
> pieces. You should be able to determine experimentally the filesize
> limit that you appear to have.
>
> -Chris
>
>
>
>
> Dinakar Desai wrote:
>
>> Hello:
>>
>> I am new to perl and bioperl. I have downloaded file from ncbi
>> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I
>> am trying to parse this file for certain pattern with Bioperl. I get
>> error.I have looked into largefasta.pm and they suggest not to use it.
>> I would appreciate, if you could help me with this problem.
>>
>> My code to test only 5 records out of this big file is as follows:
>> <code>
>> #!/usr/bin/env perl
>>
>> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
>>
>> use Bio::SeqIO;
>>
>> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
>> 'Fasta');
>>
>> $seqobj = $seqio->next_seq();
>> $count = 5;
>> while ($count > 0){
>>         print $seqobj->seq();
>>         $seqobj = $seqio->next_seq();
>
>
>
>>
>> }
>> </code>
>> and the error message is:
>> <error>
>> ------------ EXCEPTION  -------------
>> MSG: Could not open /home/desas2/data/nt for reading: File too large
>> STACK Bio::Root::IO::_initialize_io
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
>> io/Root/IO.pm:244
>> STACK Bio::SeqIO::_initialize
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
>> IO.pm:381
>> STACK Bio::SeqIO::new
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
>> 4
>> STACK Bio::SeqIO::new
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
>> 7
>> STACK toplevel ./test_fasta.pl:8
>>
>> --------------------------------------
>> </error>
>>
>> Do you have any suggestion, how I could get to read this big file and
>> get sequence object. I know how to manipulate sequence object.
>>
>> Thank you.
>>
>> Dinakar
>>
>
>
>

Thank you very much for your email. I am running this script on :
Linux  2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown
it has about 2.5 GB memory.

I used Biopython and I could open file and do some work. I thought I
will try bioperl (which seems to more mature) and I got in to this problem.

The size of file is: 6298460844 bytes (6.2 GB)

Can you suggest how I can break this file into smaller files and then
parse them.



Thank you.

Dinakar

--

Dinakar Desai, Ph.D
perl -e '$_ = "mqonx.zako\@ude";$_=~ tr /qnxzk\@.ue/npqmy.\@eu/; print'
----------------------

Everything should be made as simple as possible, but no
simpler.-----Albert Einstein



--__--__--

_______________________________________________
Bioperl-l mailing list
Bioperl-l@bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l


End of Bioperl-l Digest