[Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond

Wed Jun 20 16:22:51 EDT 2012

On Wed, Jun 20, 2012 at 8:16 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> Hi Peter,
>
> Better late than never.  I've been using the code for fairly simple things
> without any problems so far.  And thanks for writing it-- as you can tell I
> don't have a lot of free time and it has been on my TODO list forever.

I do like crossing things off TODO lists :)

> Here are my fairly unstructured comments:
>
> 1. Very long variable names for  block_start_offset, within_block_offset.
> Why not just  block  and  offset?  I'm usually all for being explicit, but
> those seem a bit excessive.   "raw" actually means uncompressed, so
> "compressed_size" and "uncompressed_size" are pretty clear to me
> (though not really that much shorter).

OK. I can have a look at that - the API names and arguments are
probably more important, but I did perhaps err on the side of
verbosity - maybe an over-reaction to reading the samtools code ;)

> 2. Purely stylistic:
>
> within_block_offset < 0 or within_block_offset >= 65536
>
> reads better to me as:
>
>     not (0 <= within_block_offset < 65536)
>

Really? As you stay, basically a style judgement, but I wonder
if mine is a fraction faster...

> 3. First few reads in _load_bgzf_block are not endian safe.  Better to do:
>
>   header_bytes  = handle.read(8)   # 4+1+1+2
>   gzip_mod_time,gzip_extra_flags,gzip_os,extra_len = \
> struct.unpack("<LBBH", header_bytes)

Good point - do you have access to a suitable machine
for testing? Our current nightly buildslaves are all rather
homogenous in terms of CPU type :(

> I know you don't actually use the values, but you take the time to name them
> after all.  Eventually, it would be nice to store the mod_time from the
> first block in the BgzfReader object.

We don't expose them in the API, no - in fact the GZIP header
of mod_time is blank in all the files I have looked at, so there
didn't seem to be any point. I also therefore deliberately left
it blank on output.

> 4. I know the struct module caches strings, but it may be useful to generate
> some of the inner loop struct objects beforehand (unpack_zzz =
> struct.Struct(zzz).unpack).

As a small optimisation? Yes, that sounds worth investigating
in the medium term (after Biopython 1.60), ideally with some
non-trivial benchmarks to measure it against.

> 5. Based on having implemented a LRU cache for the samtools/tabix bgzf
> implementation, I suggest using something the following simple LRU
> implementation:
>
> http://code.activestate.com/recipes/577969-simplified-highly-optimized-lru-cache/?in=user-178123

I'd mentioned trying using Python's OrderedDict here:
http://lists.open-bio.org/pipermail/biopython-dev/2012-April/009561.html

Presumably something more specialised like the code you linked
to could pay dividends if the number of blocks in the cache is
increased. Again, something to look at in the medium term.

> 6. I disagree with:
>
>         if size < 0:
>             raise NotImplementedError("Don't be greedy, that could be
> massive!")
>
>
> There are a million other places where this kind of safety check isn't in
> place, so why would we do so here?

A combination of being lazy (see 7 below), but also recognition that
most BGZF files will be large and doing this is unlikely to intensional.
Do you have a good use case?

> 7. Of course, that means that your recursive read implementation
> should be fixed.
>
> -Kevin

OK, 6+7 can also go on the wish list - but I think overall there is
nothing here (other than the endian issue) to warrant holding
back the BGZF support from an imminent Biopython 1.60
release.

Thank you Kevin - fresh eyes are much appreciated.

Unless anyone has further comment, I'll try to rebase & merge
the BGZF branch to the master this week.

Regards,

Peter