[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)

Peter Cock p.j.a.cock at googlemail.com
Thu Nov 10 00:01:19 UTC 2011

On Tue, Nov 8, 2011 at 6:28 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> I choose the size of the cache based on the application and access patterns.
>>  For roughly sequential sequence queries (a la samtools faidx or Pysam
>> Fastafile), all one needs is a handful of active blocks (say 16).  When
>> repeated querying tabix files via pysam, I typically use 128 blocks for the
>> best trade-off between memory and performance.  Choosing a cache size for
>> BAM files is much more complicated and I have a wide-range of setting
>> depending on how many parallel BAM streams and access patterns are employed.
>> The cache size numbers needed to be quite a bit larger before switching to
>> LRU (which was a bit surprising).  However, using even a small cache is
>> vastly beneficial for many access patterns.   The cost of re-reading a block
>> from disk can be mitigated by the OS filesystem cache, but the decompression
>> step takes non-trivial CPU time and can be triggered dozens of hundreds of
>> times per block for some sensible-seeming access patterns.
>> -Kevin
> Certainly useful food for thought - thank you. I agree that the OS
> will probably cache commonly used BGZF blocks in the filesystem
> cache, but it doesn't solve the CPU overhead of decompression.
> In the case of Bio.SeqIO.index(...) which accesses one file, and
> Bio.SeqIO.index_db(...) which may access several files, we currently
> don't offer any end user options like this. However, there is an internal
> option for the max number of handles, and a similar option could
> control the number of BGZF blocks to cache. I could try 100
> blocks (100 times 64kb is about 6MB) as the default, and redo
> the UniProt timings (random access to sequences).
> That might be a good compromise, given the SeqIO indexing code
> has no easy way to know the calling code's usage patterns.

I've tried a cache of up to 100 BGZF blocks which are cleared
"randomly" and it doesn't make a noticeable difference to my
UniProt benchmark, which is a shame but not actually very
surprising. After all, that is deliberately accessing the records
(and thus the blocks) in a random order, and the files contain
far far more than 100 blocks.

I'll need a more realistic test case to properly evaluate the cache.

One example that comes to mind is iterating over BAM reads
(which would look at blocks sequentially) but also jumping to
look at the partner reads (paired end etc) and then back again.


P.S. When I said "random", what I'm actually using is a Python
dictionary keyed on the start offset, and the dictionary's itempop
method to remove a cached block "at random" once I have got
100 blocks in memory and need to free one. Of course, this isn't
really random, it is arbitrary and likely Python implementation

More information about the Biopython-dev mailing list