[Bioperl-l] Placement of LargePrimarySeq

David Block dblock@gene.pbi.nrc.ca
Wed, 20 Sep 2000 08:38:18 -0600 (CST)


My primary concern was speed of accessing subseqs.  I am not sure if it is
faster than a perl 'substr' call, but I was leary of throwing that much
data in one place.  I was thinking that this way, Perl wouldn't have to go
back and forth between disk and ram to look at one variable.

I'll send the code snippet shortly - I'm still at home right now.

-Dave

On Wed, 20 Sep 2000, James Gilbert wrote:

> 
> 
> Dave,
> 
> It's interesting to hear that this works.  My
> experience is that splitting a long sequence into
> an array of "lines" consumes more memory than a
> string containing the sequence.
> 
> 	James
> 
> On Mon, 18 Sep 2000, David Block wrote:
> 
> > We've been working on chromosome-level for a while now, with Arabidopsis
> > (chr 2 and 4 were released last year).  What I did (in pure perl
> > fashion) was to read the sequences into a string once, then use a simple
> > regex to save the sequence in lines with an equal number of characters in
> > each line ( I think I used 70).  Then I saved that file (chr2.clean).  I
> > read in chr2.clean and store it as an array of lines, so finding any given
> > sequence is a simple arithmetical calculation and an array lookup.  We are
> > working with 128 or 256 Mb RAM machines here, and everything is fine.
> > 
> > I even store the array in a closure so multiple sequence objects can read
> > from it at the same time.  It takes about 10 seconds or so to load up on
> > my machine, but then it's there for the life of the process, and
> > retrieving subsequences takes no time.
> > 
> > Wanna see the code?  It's not SeqIO compliant, but it's part of my
> > Sequence object which implements Bio::SeqI.  It depends for now on knowing
> > the file location, which is bad.  All of this will be part of the final
> > version of Workbench, which will run on perl/MySQL.
> > 
> > HTH,
> > 
> > Dave "It's Monday" Block
> > 
> > On Mon, 18 Sep 2000, James Gilbert wrote:
> > 
> > > 
> > > 
> > > Ewan,
> > > 
> > > I've looked at the problem, and it isn't where I
> > > thought it was in the code.
> > > 
> > > I made a test sequence 40Mbp long.  I can read it
> > > into a string, but when I try to copy the string,
> > > I get the "Out of memory!" error.  (And this is on
> > > a machine with 1Gb RAM).
> > > 
> > > Perhaps Perl's memory allocator is calculating a
> > > silly number.  It might be possible to write a
> > > PrimarySeqI object as a C extension, with a more
> > > conserative memory allocaion scheme.
> > > 
> > > 	James
> > > 
> > > On Mon, 18 Sep 2000, James Gilbert wrote:
> > > 
> > > > Ewan,
> > > > 
> > > > This reminds me that I should put in a fix I've
> > > > thought of in SeqIO::fasta to stop the memory
> > > > exploding on very large sequences.
> > > > 
> > > > 	James
> > > > 
> > > > On Sun, 17 Sep 2000, Ewan Birney wrote:
> > > > 
> > > > > 
> > > > > Tomorrow I have to do some comparisons of very large sequence files
> > > > > (around chromosome 1 size, if people are interested...). Although I could
> > > > > potentially use bioperl sequences on a machine with a huge amount of real
> > > > > memory, I decided to make a quick module that stores a sequence a
> > > > > file in /tmp/ and then executes the subseq command be using seek and read
> > > > > commands.
> > > > > 
> > > > > I have this object as Bio::LargePrimarySeq. Does anyone have any
> > > > > objections about having this object in the Bio:: area directly or should
> > > > > I put it somewhere else (bascially, what do people feel about cluttering
> > > > > up the top level Bio:: area, or should I make a Bio::Seq:: directory. 
> > > > > NB - there might be some other extensions, like Bio::CachePrimarySeq which
> > > > > can cache subseq calls to improve performance for LargePrimarySeq and
> > > > > the Ensembl database equivalents...)
> > > > > 
> > > > > 
> > > > > I need to write a SeqIO system for making this and also writing out very
> > > > > large fasta files. (it should step through the sequence one MB at a time
> > > > > using the subseq method, rather than getting the whole thing out as a 
> > > > > seq). Options:
> > > > > 
> > > > > 	(a) make a new Bio::SeqIO::bigfasta module, and ->next_seq would
> > > > > make sequences with LargePrimarySeq and ->write_seq would write with
> > > > > this subseq method
> > > > > 
> > > > > 	(b) parameterise Bio::SeqIO::fasta for both of these. (have to 
> > > > > handle boring don't use $/ stuff as reading can't put everything between
> > > > > '>' as a string, as the whole point is not to have the entire sequence as
> > > > > a string in memory)
> > > > > 
> > > > > I prefer (a) to (b).
> > > > > 
> > > > > 
> > > > > 
> > > > > I got to do this tomorrow, so if people have a view, make sure that view
> > > > > gets back to me soon....
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Of course this is all main trunk stuff, not on the branch.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > -----------------------------------------------------------------
> > > > > Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> > > > > <birney@ebi.ac.uk>. 
> > > > > -----------------------------------------------------------------
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l@bioperl.org
> > > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > > > 
> > > > 
> > > > James G.R. Gilbert
> > > > The Sanger Centre
> > > > Wellcome Trust Genome Campus
> > > > Hinxton
> > > > Cambridge                        Tel: 01223 494906
> > > > CB10 1SA                         Fax: 01223 494919
> > > > 
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > > 
> > > 
> > > James G.R. Gilbert
> > > The Sanger Centre
> > > Wellcome Trust Genome Campus
> > > Hinxton
> > > Cambridge                        Tel: 01223 494906
> > > CB10 1SA                         Fax: 01223 494919
> > > 
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > 
> > 
> 
> James G.R. Gilbert
> The Sanger Centre
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge                        Tel: 01223 494906
> CB10 1SA                         Fax: 01223 494919
> 
>