[Biopython-dev] [Biopython] SeqIO.index improvement suggestions

Peter biopython at maubp.freeserve.co.uk
Tue Dec 22 15:34:37 UTC 2009


On Sun, Dec 20, 2009 at 6:06 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sat, Dec 19, 2009 at 10:42 PM, Eric Talevich wrote:
>> On Sat, Dec 19, 2009 at 1:57 AM, Peter wrote:
>>>
>>> This is a vague idea (which I haven't tried yet), but maybe the
>>> Bio.SeqIO.index() function could take an optional argument
>>> (gzip=True, or something more general like archive=...) which
>>> would cause the file to be opened via the gzip module instead?
>>
>> Or: open=open -- accept a function that opens the file; by default, the
>> built-in open function, but easily replaced by gzip.open, bz2.BZ2File, or a
>> user-defined function to open zip files (since that's less straightforward).
>
> That's what I had in mind with the "archive=..." bit (I should have
> been clearer), but "open" is probably a better name for it (assuming
> it isn't going to become a reserved word in future versions of Python).

Proof of concept on github:
http://github.com/peterjc/biopython/tree/index-zip

This is using open_function as the new argument name (to match
the existing key_function and avoid any confusion with the built in
name open). I'm open to debate on this.

Points to note, this is untested on Windows. In particular we need
to look at gzipped plain text files using DOS/Windows new lines
(rare case?) plus gzipped plain text files using Unix new lines
(likely to be the more common of the two I'd expect). From my
initial checks, while gzip.open() does take a mode argument it
doesn't seem to support the "rU" value for universal new line
read mode. This spoils my plan to give the open_function both
the filename and the desired mode (generally "rU", but for SFF
files etc we will want to use "rb").

Peter



More information about the Biopython-dev mailing list