[Biojava-dev] Fwd: Assembly data reading

Mon Jul 13 02:29:56 EDT 2009

I would agree that there is a strong need for this kind of thing in biojava.

As Richard says you probably can't fit it in memory so you may want to
memory map it. There are classes in the javax.nio package that can help a
lot with this.

Also I have had some success with in-memory compression of large files using
LZ compression. Essentially the memory representation of the file is LZ
compressed and compression and decompression are handled on the fly. Again
there are Java utility classes that can help.

- Mark

On Mon, Jul 13, 2009 at 1:20 PM, Richard Holland
<holland at eaglegenomics.com>wrote:

> Nothing within BJ can parse the 454 .sff files directly. However I think
> there is a growing need for it so if anyone is willing to contribute
> code, it would be very welcome.
>
> There is also no .ace parser, although in 2007 someone volunteered to
> write one but nothing happened, and there was a previous post (many
> years ago!) from someone else who already had some working code but
> again nothing seems to have happened:
>
> http://portal.open-bio.org/pipermail/biojava-l/2001-June/001283.html
> http://lists.open-bio.org/pipermail/biojava-l/2007-July/005900.html
>
> So to start with, someone (perhaps yourself? that would be nice! :) )
> needs to volunteer to write either a .ace or .sff parser, or both.
>
> The thing to bear in mind with 454 contigs as you rightly point out is
> the sheer size of the things. The requirement to keep them entirely in
> memory is likely to be unworkable as it would leave little room for
> anything else to run on your average machine. I would suggest either
> memory-mapping the file itself, or parsing and writing out a
> memory-mapped summary file containing the bits of data you're interested
> in. (Memory-mapping is where you keep an index in memory indicating
> where in the file each record is, so that when you need to access them
> you load them on-the-fly from the file and drop them out of memory again
> immediately after use. An accelerated form of this is to put the loaded
> records into some kind of LRU cache which holds only the most recently
> accessed records and then check that cache first to see if you've
> already loaded the record before accessing the file directly.)
>
> cheers,
> Richard
>
>
> On Sun, 2009-07-12 at 23:41 +0200, Paolo Pavan wrote:
> > Hi,
> > I would like to post again with some adjustments a question I put some
> > times ago because maybe this is a more correct list, apologize for the
> > repeating.
> > Can someone kindly give me his advise?
> >
> > thank you in advance,
> > Paolo
> >
> >
> > ---------- Forwarded message ----------
> > From: Paolo Pavan <paolo.pavan at gmail.com>
> > Date: 2009/7/9
> > Subject: Assembly data reading
> > To: Biojava-l at lists.open-bio.org
> >
> >
> > Hi everybody,
> > I'm almost new to this topic, I would like to know if there is
> > something can help me to load in my java program data from a large 454
> > contig. I need to retain in memory and access data from the single
> > reads forming the contig too.
> > I suppose these informations are in a *.sff file, if it is not
> > possible to load such file it should be ok to load a *.ace (phrap)
> > data file that I have too.
> > Many thanks for any suggestion you can give me!
> >
> > Greetings,
> > Paolo
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>