[Biopython] Fasta.index_file: functionality removed?

Peter biopython at maubp.freeserve.co.uk
Thu Jun 18 05:23:27 EDT 2009


On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay<cmckay at u.washington.edu> wrote:
> Hello, I depend on functionality provided by Fasta.index_file to index a
> large file (5 million sequences), too large to put in memory, and access it
> in a dictionary-like way. Newer versions of Biopython have removed (or
> hopefully moved) this functionality.

Yes, that is correct.  I'd have to digg a little deeper for more details, but
Bio.Fasta.index_file and the associated Bio.Fasta.Dictionary were
deprecated in September 2007, so the warning would have first been in
Biopython 1.45 (released March 22, 2008). This was related to problems
from mxTextTools 3.0 in our Martel/ Mindy parsing infrastructure (which
has been phased out and will not be included with Biopython 1.51 at all).
See:
http://lists.open-bio.org/pipermail/biopython/2007-September/003724.html

What version of Biopython were you using, and did you suddenly try
installing a very recent version and discover this? I'm trying to understand
if there is anything our deprecation process we could have done differently.

> I attempted to figure out what happened
> to the functionality by searching the mailing list, to no avail. Also
> Biopython's ViewCVS page is down, so I can't pursue that route.

Apparently there is glitch with one of the virtual machines hosting that,
the OBF are looking into it - I was hoping it would fixed by now. CVS
itself is fine (if you want to use it directly), or you can also browse the
the history on github (although this doesn't show the release tags nicely).
http://github.com/biopython/biopython/tree/master

> So if someone would please suggest an alternative way to do the same thing
> in newer biopython versions, I'd appreciate it.  I tried SeqIO.to_dict, but it
> seems to load the whole 5 million sequences (or just the index?) into memory
> rather than make an index file. I become memory bound rather quickly this
> way, and then my script grinds to a halt.

Yes, SeqIO.to_dict() creates a standard in memory python dictionary,
which would be a bad idea for 5 million sequences. I'll reply about other
options in a second email.

> As a side issue, how can I tell what version of biopython I'm using in old
> versions before "Bio.__version__" was introduced?

There was no official way, however, for some time the Martel version was
kept in sync so you could do this:

$ python
>>> import Martel
>>> print Martel.__version__
1.49

If you don't have mxTextTools installed, this will fail with an ImportError.
For more details see:
http://lists.open-bio.org/pipermail/biopython/2009-February/004940.html

Peter



More information about the Biopython mailing list