[Biojava-l] reading nib sequence files
mark.schreiber at group.novartis.com
mark.schreiber at group.novartis.com
Thu Jan 27 22:24:38 EST 2005
I think if you want to use Java the nio packages are the way to go.
Just my $0.02
Dan Baggott <dan.baggott.work at gmail.com>
Sent by: biojava-l-bounces at portal.open-bio.org
01/28/2005 07:01 AM
Please respond to baggott2
To: biojava-list List <biojava-l at biojava.org>
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-l] reading nib sequence files
That question started off a flurry... Thanks for the input! So, from
my narrow and selfish perspective, the short of this thread is that
there isn't any "ready to go" nib i/o code and that the existing
BioJava parsing framework is not designed to deal with binary files so
it would be less than trivial to adapt it.
I don't have much experience with reading from large files (binary or
otherwise). Is there a general consensus on the path of least
resistance for implementing fast random access to large-ish nucleotide
sequences (ie on the order of human chromosome sized)? I'm not so
concerned about the size of the sequence files, just speed of access.
I mentioned the nib format in the first place becuase I was impressed
with the speed at which Jim Kent's nibFrag utility extracts sequence
-- pretty much immediately from the human perspective.
Dan
On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell
<Russell.Smithies at agresearch.co.nz> wrote:
> You don't need to extract the whole file with ZipInputStream first.
> I managed to get the part I wanted by setting the offset to the start of
> the sequence (was using zipped chromosomes in fasta format) and the
> buffer to the length I wanted.
> It was a year or 2 ago and I probably don't have the code anymore but it
> is possible ;-)
>
> Russell Smithies
>
> Bioinformatics Software Developer
> AgResearch Invermay
> Private Bag 50034
> Puddle Alley
> Mosgiel
> New Zealand
>
> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of Richard
> HOLLAND
>
> Sent: Monday, 24 January 2005 10:19 p.m.
> To: VERHOEF Frans; mark.schreiber at group.novartis.com
> Cc: biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
>
> The trouble with ZIP is that to do random-access reads of the sequence
> (eg. give me all bases from X to Y) you have to unzip the whole sequence
> each time. That makes it quite a bit slower. The solution needs to be a
> compression algorithm of some kind which allows instant random access
> without slowing down the create/update process too much either. Hence a
> custom fixed-width binary solution would be the first thing that comes
> to mind, but it may not be the only one.
>
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199
>
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
>
> > -----Original Message-----
> > From: VERHOEF Frans
> > Sent: Monday, January 24, 2005 5:16 PM
> > To: Richard HOLLAND; mark.schreiber at group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> >
> > You could always ZIPStream it out for even more compression.
> >
> > Frans
> >
> > -----Original Message-----
> > From: biojava-l-bounces at portal.open-bio.org
> > [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of
> > Richard HOLLAND
> > Sent: Monday, January 24, 2005 04:59 PM
> > To: mark.schreiber at group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> > NIB files store one base per 4 bits, non-variable, giving a
> > 50% compression rate and a maximum arity of 16 different base
> > values per position.
> >
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you are
> > not the intended recipient, please delete it and notify us
> > immediately. Please do not copy or use it for any purpose, or
> > disclose its content to any other person. Thank you.
> > ---------------------------------------------
> >
> >
> > > -----Original Message-----
> > > From: mark.schreiber at group.novartis.com
> > > [mailto:mark.schreiber at group.novartis.com]
> > > Sent: Monday, January 24, 2005 4:53 PM
> > > To: Richard HOLLAND
> > > Cc: baggott2 at llnl.gov; biojava-list List; Thomas Down
> > > Subject: RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > BioJava does already do some compression on large sequences
> > > (or at least
> > > it used to). Like you say you can bit pack a lot. Ambiguity causes
> > > problems as you can have more than four symbols for DNA
> > > (including n, y, r
> > > etc).
> > >
> > > Does Jim Kent's schema offer better compression? Even if it
> > > doens't the
> > > use of a ByteBuffer will probably increase the speed of the current
> > > implementations.
> > >
> > > - Mark
> > >
> > >
> > >
> > >
> > >
> > > "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> > > 01/24/2005 04:47 PM
> > >
> > >
> > > To: Mark Schreiber/GP/Novartis at PH, "Thomas Down"
> > > <td2 at sanger.ac.uk>
> > > cc: "biojava-list List" <biojava-l at biojava.org>,
> > > <baggott2 at llnl.gov>
> > > Subject: RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > I think the idea of storing sequences internally as
> > compressed binary
> > > sequence would be a good idea regardless, for any symbol list.
> > > Currently each Symbol in a SymbolList requires one word of
> > memory (the
> > > size of a memory pointer to the singleton Symbol
> > instances). Therefore
> > > any SymbolList of length X containing symbols from an n-ary
> > alphabet
> > > would require X words of memory to store it, plus the
> > overhead of the
> > > SymbolList and n Symbol singleton instances (admittedly
> > shared between
> > > all SymbolLists currently in memory).
> > >
> > > If you used a compressed binary format internally, doing away with
> > > explicit Symbol references and representing each symbol in a
> > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G
> > > etc.), you would require much less space than even the
> > singleton model
> > > above. This
> > > way you could fit four DNA symbols into a single byte of memory, as
> > > opposed to four words of memory. The number of bits required for a
> > > symbol in any given alphabet is merely log base 2 of the size of the
> > > alphabet, rounded up to the nearest whole number. eg. for
> > the English
> > > alphabet of 26 letters only, you would need 5 bits, or in
> > > terms of whole
> > > bytes, you would be able to fit 8 symbols into 5 bytes.
> > >
> > > To do this you would need to define a 'bits' parameter on
> > the alphabet
> > > which is calculated from the number of symbols in the alphabet, a
> > > 'bitMap' parameter on the alphabet which maps symbols to bit values
> > > (and vice versa with 'inverseBitMap'), and keep a separate
> > > 'length' parameter
> > > in the SymbolList which would be used to tell the binary
> > > decoder when to
> > > stop parsing the sequence (as you can only store whole bytes,
> > > there will
> > > often be trailing zeroes in the buffer which could be
> > > misleading without
> > > this extra parameter).
> > >
> > > You could always return singleton Symbol objects if requested, by
> > > decoding the binary sequence on the fly, but you would no
> > longer need
> > > to store the sequence using them.
> > >
> > > Is this worth considering for the big BioJava rewrite?
> > >
> > > Richard Holland
> > > Bioinformatics Specialist
> > > GIS extension 8199
> > >
> > > ---------------------------------------------
> > > This email is confidential and may be privileged. If you
> > are not the
> > > intended recipient, please delete it and notify us
> > immediately. Please
> > > do not copy or use it for any purpose, or disclose its
> > content to any
> > > other person. Thank you.
> > > ---------------------------------------------
> > >
> > >
> > > > -----Original Message-----
> > > > From: mark.schreiber at group.novartis.com
> > > > [mailto:mark.schreiber at group.novartis.com]
> > > > Sent: Monday, January 24, 2005 4:37 PM
> > > > To: Thomas Down
> > > > Cc: biojava-list List; Richard HOLLAND;
> > > > "<baggott2 at llnl.gov"@novartis.com
> > > > Subject: Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > > I'd need to brush up on my nio, and my c !
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thomas Down <td2 at sanger.ac.uk>
> > > > 01/24/2005 04:34 PM
> > > >
> > > >
> > > > To: "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> > > > cc: "<baggott2 at llnl.gov>", biojava-list List
> > > > <biojava-l at biojava.org>, Mark
> > > > Schreiber/GP/Novartis at PH
> > > > Subject: Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > >
> > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > >
> > > > > It's a compressed binary format. I doubt BioJava would be
> > > > able to read
> > > > > it without a lot of effort as the current parser framework
> > > > is set up
> > > > > for
> > > > > text input only.
> > > >
> > > > Nib support probably wouldn't fit into the text-oriented parsing
> > > > framework, but I'm sure it could be supported somehow if
> > there was
> > > > demand. A quick google doesn't turn up any format
> > > documentation, but
> > > > Jim Kent's IO code is at:
> > > >
> > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > >
> > > > One interesting way to handle this might be to open the nib
> > > file as a
> > > > MappedByteBuffer, and back a SymbolList directly using that --
> > > > potentially giving us an efficient way of working with huge
> > > > sequences..
> > > > Any interest in that?
> > > >
> > > > Thomas.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> > _______________________________________________
> > Biojava-l mailing list - Biojava-l at biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
More information about the Biojava-l
mailing list