[Biopython-dev] [Biopython] SeqIO.index improvement suggestions

Fri Dec 18 23:39:28 UTC 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sorry to take this to the discussion list, took a bit longer than I
expected to get the approval.

Bringing now the subject to the right place. Leaving full quote history
to help the reading.

Quoting Peter on 12/18/2009 09:39 PM:
> Hi Renato,
> 
> I'm cooking dinner while writing this, so it won't be as in depth as
> usual...
> 
> On Fri, Dec 18, 2009 at 5:17 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>> [I tried submitting this message to the dev mailing list, but got
>> rejected since I'm not yet authorized to post there, so here it goes]
> 
> Have you definitely subscribed to the dev list? That should be all that
> is required to post there, and this discussion would be better suited
> there.
> 
>> Hi everyone,
>>
>> I'm working on changes to the Bio.SeqIO.index() function to make it more
>> consistent with the .read and .parse i.e. accept a filehandle instead of
>> a filename and also to include a way to cache the index into a file to
>> speed up the process.
>>
>> The reason why we are implementing these two is because we were going to
>> implement our own index solution until we realized this was added to 1.52.
>>
>> However the implementation in 1.52 has a few limitations.
> 
> Yes, this was designed to cover basic use cases in a general way,
> but with the option in future to do other things - and in particular
> saving the index to disk was kept in mind.
> 
>> One limitation is that we are using a gzipped database for the sake of
>> space and using gzip.open() to create the file-handle that would then be
>> passed to .parse(). The same was not doable with .index().
>> This is already implemented in
>> http://github.com/Unode/biopython/commit/6fc390151452e3ddf26a117269132125a3ffb3fe
> 
> That was a deliberate choice in that the index code wants to "own"
> the handle. If other code has access to the handle, there is a risky
> of different bits of code moving the handle pointer etc. But, if you
> are careful it could be done.
The way I approached it was to reset the handle pointer to the first
position, since we would like to index the full file. But I understand
that if the user uses the same handle on different files weird results
may happen.
Something that could be a simple workaround would be to copy the
filehandle object in such a way that it's properties are maintained
(like being a gzip.open() filehandle) but it's use doesn't affect the
use of the original handle. However I don't know if this is possible.

> 
> There are also issues here in combination with saving the index.
> With a filename, the code can easily reopen the file in the same
> mode. With a handle, things are more tricky. You have non-file
> handles to consider - such as the gzip example. There is also the
> problem of recording the file mode (normal text, universal text,
> or binary - which we will need for SFF files - code already written).
> 
I see, only after your comment I realized handle.name and handle.mode
are only available in normal filehandles. The gzip.open() example stores
the filename in .filename while the .mode seems to have a different meaning.
> If we do change the code to allow handles, it would have to be
> to allow handles OR filenames to be compatible with Biopython
> 1.52 and 1.53 (which take just filenames). This could be handled
> as in Bio.SeqIO.convert(), which also allows both (which was the
> subject of some discussion!).
> 
I'll have to look more on the example and consider the fact that my
current implementation breaks compatibility with previous code and that
not everything needed (filename, mode,...) is accessible in filehandles.
>> The second is that we are going to use this feature to quick search the
>> database in a web application. Here we have the limitation that we don't
>> have persistence across web requests, which means that we would need to
>> recalculate the index on every web request.
>>
>> The details of how we plan to implement this are the following:
>>
>> cPickle the internal dictionary of offsets and save it on the database
>> folder with the same name as the database + .index. The consistency
>> check on whether the file has changed will be performed based on name
>> and timestamp. By default .index() will search for this file, check the
>> timestamp and use the cache if they match, otherwise they will be
>> recalculated. The save function will be available like:
>>
>>>>> d = SeqIO.index(...)
>>>>> d.save(filename)
>> where filename is optional and defaults to "%s.index" % _handle.name
>>
>> We already have a solution like this implemented with subclasses of
>> SeqIO._index, it's just a matter of reworking that and merge it into
>> BioPython if you consider a good addition to the code.
>>
>> I would like to hear your comments and suggestions on this.
> 
> Yes, saving indexes is an obvious addition. I have explored
> using pickle via shelve, and also SQLite - there are
> implementations of this on my github respository, plus
> begun to look into the existing OBF Open Biological
> Database Access (OBDA) specification for cross project
> compatibility. Other potential benefits here are reduced
> memory usage if we don't keep the dictionary
> of offsets in RAM.
I did try to use pickle directly on the dict like object that is
returned from SeqIO.index() but pickle was not happy with it. The SQLite
approach also crossed my mind and also BioSQL or just some custom SQL
database, but the RAM approach seemed good enough, at least for our
current uses. I can see though that some file formats will require a lot
more RAM depending on what is indexed and their size. In the end it came
out as cPickled dictionaries for faster access.
> 
> http://github.com/peterjc/biopython/tree/index-shelve
> http://github.com/peterjc/biopython/tree/index-sqlite
> 
> There is a potential complication with index sub-classes
> which do more specialised indexing (e.g. GenBank files,
> and for a more extreme case, SFF files). See:
> http://github.com/peterjc/biopython/tree/sff-seqio
For these I would have to do it on a unittest base, I'm not familiar
with the formats. Also the implementation I did was based on the current
master branch of biopython. I now realize a lot more has been done
outside of it that I should look into.
> 
> Anyway - great to see you are finding the code useful,
> and have some quite similar ideas for how to extend
> it further.
> 
> Peter
Thanks for all that info, I have a lot to dig into and see if I can
actually contribute with something. You seem to have pretty much
everything sorted ;)

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkssEqkACgkQYh11EUYTX9QWHwCeOIuuaEGA3qLvB1EHamDohpZ3
bj0AnRAkP9jOGpvTnSc0W7YgFyX/Ard/
=S45W
-----END PGP SIGNATURE-----