[Biojava-dev] Fwd: Assembly data reading

Richard Holland holland at eaglegenomics.com
Mon Jul 13 01:20:35 EDT 2009


Nothing within BJ can parse the 454 .sff files directly. However I think
there is a growing need for it so if anyone is willing to contribute
code, it would be very welcome.

There is also no .ace parser, although in 2007 someone volunteered to
write one but nothing happened, and there was a previous post (many
years ago!) from someone else who already had some working code but
again nothing seems to have happened: 

http://portal.open-bio.org/pipermail/biojava-l/2001-June/001283.html
http://lists.open-bio.org/pipermail/biojava-l/2007-July/005900.html

So to start with, someone (perhaps yourself? that would be nice! :) )
needs to volunteer to write either a .ace or .sff parser, or both. 

The thing to bear in mind with 454 contigs as you rightly point out is
the sheer size of the things. The requirement to keep them entirely in
memory is likely to be unworkable as it would leave little room for
anything else to run on your average machine. I would suggest either
memory-mapping the file itself, or parsing and writing out a
memory-mapped summary file containing the bits of data you're interested
in. (Memory-mapping is where you keep an index in memory indicating
where in the file each record is, so that when you need to access them
you load them on-the-fly from the file and drop them out of memory again
immediately after use. An accelerated form of this is to put the loaded
records into some kind of LRU cache which holds only the most recently
accessed records and then check that cache first to see if you've
already loaded the record before accessing the file directly.)

cheers,
Richard


On Sun, 2009-07-12 at 23:41 +0200, Paolo Pavan wrote:
> Hi,
> I would like to post again with some adjustments a question I put some
> times ago because maybe this is a more correct list, apologize for the
> repeating.
> Can someone kindly give me his advise?
> 
> thank you in advance,
> Paolo
> 
> 
> ---------- Forwarded message ----------
> From: Paolo Pavan <paolo.pavan at gmail.com>
> Date: 2009/7/9
> Subject: Assembly data reading
> To: Biojava-l at lists.open-bio.org
> 
> 
> Hi everybody,
> I'm almost new to this topic, I would like to know if there is
> something can help me to load in my java program data from a large 454
> contig. I need to retain in memory and access data from the single
> reads forming the contig too.
> I suppose these informations are in a *.sff file, if it is not
> possible to load such file it should be ok to load a *.ace (phrap)
> data file that I have too.
> Many thanks for any suggestion you can give me!
> 
> Greetings,
> Paolo
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
-- 
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the biojava-dev mailing list