[Biopython] Fasta.index_file: functionality removed?

Thu Jun 18 10:00:29 UTC 2009

On Thu, Jun 18, 2009 at 2:13 AM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
>
> For the short term, the easiest solution for you is probably to pick up Bio.Fasta
> from an older version of Biopython.

Note you would also need Martel and Mindy (still included in Biopython 1.50,
but won't be in Biopython 1.51), and ideally mxTextTools 2.0 (not
mxTextTools 3.0).

> For the long term, it's probably best to integrate the indexing functionality in
> some way in Bio.SeqIO. Do you have some suggestions on how (from a
> user's perspective) this functionality should look like?

We have thought about this before - Bio.SeqIO is a high level interface which
works for a broad range of file types, including interleaved file formats. An
index file approach only really makes sense for a minority of the supported file
formats, simple sequential files with no complicated file level header/footer
structure. i.e. It could work on FASTA, GenBank, EMBL, SwissProt, FASTQ,
etc, but is much more complicated for say ClustalW, PHYLIP, XML, SFF, ...

An alternative approach might be to go to a full database (e.g. BioSQL),
although that is probably overkill here. There are other python options like
pickle and/or shelve (see also Ivan Rossi's email) which I know other people
have used in combination with Bio.SeqIO in the past - I even tried it myself:

http://lists.open-bio.org/pipermail/biopython/2007-September/003748.html
http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003071.html
http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003072.html

i.e. Using pickle (or perhaps shelve) would allow a file format neutral solution
on SeqRecord objects (e.g. on top of Bio.SeqIO) at the cost of larger temp
files (because they store the whole record, not just a position in the
parent file).
This can be an advantage, in that the index files themselves are useful even
without the parent file. Also, you could generate the set of SeqRecord objects
in a script (e.g. an on the fly filtered version of a FASTA file). You
don't have
to be indexing a file :)

Peter