[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)

Kevin Jacobs <jacobs@bioinformed.com> bioinformed at gmail.com
Tue Nov 8 13:11:56 EST 2011


On Tue, Nov 8, 2011 at 12:52 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Nov 8, 2011 at 5:40 PM, Kevin Jacobs wrote:
> > I've added a proper LRU uncompressed block cache to the samtools tabix
> code,
> > if that would be of any help.  It greatly improves performance for many
> > access patterns.  (I didn't look to see if you'd already done that in
> your
> > code.)
> > -Kevin
>
> Hi Kevin,
>
> Is this already in the mainline samtools tabix repository?
>
> The current implementation in my Python code just caches the
> current block - but a simple pool had occurred to me. How many
> blocks (given each is 64kb) and how best to pick that number
> isn't obvious to me. Perhaps you can suggest some sensible
> defaults?
>
> In fact, a proper LRU cache would make sense for the handle
> pool in Bio.SeqIO.index_db(...) as well.
>
>
Hi Peter,

There is a random-eviction cache implemented in the mainline that is okay,
but it is turned off by default and, if enabled, can be very inefficient if
it keeps evicting your most active blocks.  Converting the cache it to LRU
was very simple and I've been using it locally for some time now, but I
haven't had time to send the changes on to Heng Li.

I choose the size of the cache based on the application and access
patterns.  For roughly sequential sequence queries (a la samtools faidx or
Pysam Fastafile), all one needs is a handful of active blocks (say 16).
 When repeated querying tabix files via pysam, I typically use 128 blocks
for the best trade-off between memory and performance.  Choosing a cache
size for BAM files is much more complicated and I have a wide-range of
setting depending on how many parallel BAM streams and access patterns are
employed.

The cache size numbers needed to be quite a bit larger before switching to
LRU (which was a bit surprising).  However, using even a small cache is
vastly beneficial for many access patterns.   The cost of re-reading a
block from disk can be mitigated by the OS filesystem cache, but the
decompression step takes non-trivial CPU time and can be triggered dozens
of hundreds of times per block for some sensible-seeming access patterns.

-Kevin


More information about the Biopython-dev mailing list