[Biopython] Problems parsing with PSIBlastParser

Peter biopython at maubp.freeserve.co.uk
Tue Oct 13 13:36:44 UTC 2009


On Tue, Oct 13, 2009 at 12:58 PM, Miguel Ortiz Lombardia
<ibdeno at gmail.com> wrote:
>>
>> Hmm - the switch to using subprocess (on Python 2.4+ or later) was made
>> in October 2008, and would have first appeared in Biopython 1.49. Maybe
>> you were using Biopython 1.48 before - or the issue is something else.
>>
>> Peter
>
>
> It may well have been 1.48... Having a closer look at the files from my last
> successful runs I discover the actually come from November 2008...
>
> I'm now running the tests you suggested.

Let me know what they show. How long do these BLAST runs take?
Perhaps I was ambitious with the number of suggestions to try ;)

Assuming the problem is with how we are calling the BLAST tool via the
subprocess module, I have two suggested fixes in mind. The first is a change
to the _invoke_blast() function in Bio/Blast/NCBIStandalone.py, essentially
replace these lines:

    blast_process.stdin.close()
    return blast_process.stdout, blast_process.stderr

With this:

    stdout, stderr = blast_process.communicate()
    from StringIO import StringIO
    return StringIO(stdout), StringIO(stderr)

We had to make a similar change to Bio.Clustalw for Bug 2804. This uses
subprocess to buffer the data in order to avoid any deadlock reading from
the handles. I hadn't made this change before as it imposes a memory
overhead (and BLAST output is often *very* large, especially as XML),
and until now there hadn't been any problems reported. It would be worth
trying in your situation (even just to confirm the source of the error), but
I don't think we should make this change for the official distribution.

The second option (which I mentioned before) is to tell blastpgp to write
its output directly to a file, and then parse the file. This is how I normally
run large BLAST jobs. This is possible but not elegant via the function
Bio.Blast.NCBIStandalone.blastpgp (which always returns stdout/stderr
handles). Bug 2654 has an example,
http://bugzilla.open-bio.org/show_bug.cgi?id=2654

However, what I want to recommend instead is to use the more flexible
Bio.Blast.Applications objects instead (in this case, the class
BlastpgpCommandline). I had planed to update the BLAST chapter
of the Biopython Tutorial to cover this, but it didn't happen in time for
the Biopython 1.52 release. However, the alignment chapter goes
through several examples of this style of command line tool wrapper,
and the BLAST wrappers work in exactly the same way.

Using these "lower level" application wrappers, it is up to you to invoke
subprocess (or another system call) as you see fit (e.g. with pipes).
This is more flexible than the old Bio.Blast.NCBIStandalone.blastpgp
function (and others like it) where the behaviour could not be set.

Feel free to ask for clarification on this - questions now will help for
rewriting the BLAST chapter later on ;)

Regards,

Peter

P.S. See also http://docs.python.org/library/subprocess.html



More information about the Biopython mailing list