[Biopython] SeqIO.index improvement suggestions

Renato Alves rjalves at igc.gulbenkian.pt
Fri Dec 18 17:17:44 UTC 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[I tried submitting this message to the dev mailing list, but got
rejected since I'm not yet authorized to post there, so here it goes]

Hi everyone,

I'm working on changes to the Bio.SeqIO.index() function to make it more
consistent with the .read and .parse i.e. accept a filehandle instead of
a filename and also to include a way to cache the index into a file to
speed up the process.

The reason why we are implementing these two is because we were going to
implement our own index solution until we realized this was added to 1.52.

However the implementation in 1.52 has a few limitations.

One limitation is that we are using a gzipped database for the sake of
space and using gzip.open() to create the file-handle that would then be
passed to .parse(). The same was not doable with .index().
This is already implemented in
http://github.com/Unode/biopython/commit/6fc390151452e3ddf26a117269132125a3ffb3fe

The second is that we are going to use this feature to quick search the
database in a web application. Here we have the limitation that we don't
have persistence across web requests, which means that we would need to
recalculate the index on every web request.

The details of how we plan to implement this are the following:

cPickle the internal dictionary of offsets and save it on the database
folder with the same name as the database + .index. The consistency
check on whether the file has changed will be performed based on name
and timestamp. By default .index() will search for this file, check the
timestamp and use the cache if they match, otherwise they will be
recalculated. The save function will be available like:

>>> >>> d = SeqIO.index(...)
>>> >>> d.save(filename)

where filename is optional and defaults to "%s.index" % _handle.name

We already have a solution like this implemented with subclasses of
SeqIO._index, it's just a matter of reworking that and merge it into
BioPython if you consider a good addition to the code.

I would like to hear your comments and suggestions on this.

Regards,
Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksruTIACgkQYh11EUYTX9TymgCeL6hu3Uz//itSHx38k9KjfZJg
dGUAmwVCgaI9G/19VKiUolrXogelgrPs
=M+xw
-----END PGP SIGNATURE-----



More information about the Biopython mailing list