[Biopython] Bit encoded sequences, was: Correcting short read errors based on k-mer coverage

Peter biopython at maubp.freeserve.co.uk
Mon Sep 28 13:02:33 UTC 2009


On Mon, Sep 28, 2009 at 1:18 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>
> 2009/9/28 Peter <biopython at maubp.freeserve.co.uk>:
>>
>> An alternative approach reduced the memory needed for this example
>> from 1.25GB resident to about 0.8GB resident, while still taking about
>> 5 mins. Instead of storing the kmers as strings, I encoded them as large
>> integers (basically using 2 bits per letter instead of 8 bits). This means
>> for kmers up to and including 32-mers, you need only a 64bit unsigned
>> long. You can do this in Python, but my initial code was a bit slow - so
>> I redid it as a Python C extension. The only problem here is what to do
>> with ambiguous sequences - for example any N characters?
>
> Sounds like a good solution... does BioPython have any modules for
> hiding this kind of compressed sequence representation? i.e. using
> some object to represent the string instead of a string, where the
> object has this compression 'under the hood'?

Not at the moment, no. However, a bit encoded Seq subclass for
unambiguous DNA or RNA is something I had in mind while looking
at encoding kmers. I think BioJava does something like this already.

Another reason to look at this is with an eye on the future for Python
3, which makes unicode the default (although byte strings remain,
we'll probably want to use them in the Seq object).

> I think most people reserve this kind of compression for
> 'non-ambiguous' strings only, and cludge the ambiguity codes[1] if
> necessary.
>
> [1] http://droog.gs.washington.edu/parc/images/iupac.html

For ambiguous DNA or RNA, you can use four bits for each bp
(i.e. can it be an A, C, G, or T - thus you might encode this as
1000 = A, 0100 = C, ..., 1100 = K and 1111 = N). This requires
50% of the memory of the naive one byte for each bp scheme.

For unambiguous DNA or RNA you just need two bits, say
A = 00, C = 01, G = 10 and T=11. This mapping should make
taking the complement very fast via bit flipping. Dealing with
the reverse complement requires a little more thought (e.g.
byte alignment issues if the sequence is not a multiple of
four bp in length).

For proteins things are less easy, you'd need at least five bits
(2^5 = 32 combinations) which isn't ideal. You can have
compression, but the byte boundaries may slow things down.

Then there are things like gap characters, stop symbols, mixed
case and any other ad-hoc additions.

For ambiguous single case DNA/RNA we could do two bp per
byte, which in itself may not be worth the hassle. There would
be scope for (reverse) complement optimisations, however, if
it allowed faster sequence matching things become more
interesting...

But certainly, for unambiguous single case DNA/RNA we could
get four bp per byte, which seems a worthwhile improvement.

>> This still used a Python dictionary to hold the (integer) encoded kmer
>> sequences as keys, and their (integer) counts as values. As noted
>> before, there are disk based options here like an SQLite database.
>
> Yeah, I was wondering about a Berkeley DB or similar.

Berkeley DB is certainly a sensible option to look at. Python 2.x
includes DBD wrappers, but sadly this has been dropped from
the standard libraries in Python 3.x which is why I had a slighly
leaning to trying SQLite first of all.

> I wonder if there is any way to do this approximately and still get
> good error correction statistics? (I'm thinking about the way BowTie
> works using approximate hash matching to pre-filter alignments).

I don't know exactly what BowTie does.

> Any hints from the two papers?

Those that I have looked at are vague about the implementation
details, but I may just have not read them carefully enough.

Peter



More information about the Biopython mailing list