[Biopython] Problems parsing with PSIBlastParser
Peter
biopython at maubp.freeserve.co.uk
Tue Oct 13 11:41:58 UTC 2009
Don't forget to CC the mailing list ;)
On Tue, Oct 13, 2009 at 12:22 PM, Miguel Ortiz Lombardia
<ibdeno at gmail.com> wrote:
>
>
> Le 13 oct. 09 à 13:10, Peter a écrit :
>
>> On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia
>> <ibdeno at gmail.com> wrote:
>>>
>>> Dear list members,
>>>
>>> I have a problem with NCBIStandalone.PSIBlastParser, which I need to use
>>> instead of NCBIXML since the latter one lacks some record properties that
>>> I need.
>>>
>>> My code used to work until recently (say three months) and now it seems
>>> something has changed in the latest biopython (1.52-1, I install it on an
>>> intel OSX 10.5.8 via fink). I get the same problem irrespectively of
>>> whether I use python 2.5 or 2.6.
>>
>> Thanks for filing the bug, and supplying the example output files.
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>
>> Do you remember what version of Biopython you used to be running
>> before updating to 1.52? This would help to narrow down the change
>> triggering this problem.
>>
>
> Sorry, can't tell it for sure, but it was whatever version was current in
> March 2009.
Probably Biopython 1.49 then. That may help.
>> In the mean time, I have tried parsing your sample output, and it seems
>> fine:
>>
>> from Bio.Blast.NCBIStandalone import PSIBlastParser
>> b_parser = PSIBlastParser()
>> handle = open("Q3V4Q0.psiblast.txt")
>> b_record = b_parser.parse(handle)
>> handle.close()
>> for b_round in b_record.rounds :
>> print "Round %i has %i alignments" \
>> % (b_round.number, len(b_round.alignments))
>>
>>
>> Gives:
>>
>> Round 1 has 385 alignments
>> Round 2 has 1000 alignments
>> Round 3 has 1000 alignments
>> Round 4 has 1000 alignments
>> Round 5 has 1000 alignments
>>
>
> Yes, that's also what I see with my code: text files can be parsed.
OK - good. So it doesn't look like a parser bug.
>> So, if the file parser is fine, then my guess is this is something to do
>> with
>> how we are running PSI-BLAST via NCBIStandalone.blastpgp - and this
>> code has changed in recent releases. It used to use the python function
>> os.popen3 but this was deprecated in Python 2.6 and we now use the
>> subprocess library.
>
> I think this is the most likely explanation.
>
>> It is also possible that the command line options you used when running
>> BLAST by hand to supply me the example output differed from what
>> was used in your Python script.
>
> I don't think so, I just used the same command line that was launched from
> the python script (got it from a 'ps' command)
Great :)
>> What exactly did you type at the command line to make the example
>> output you sent me? I'd like to double check the Python code is using
>> the same thing...
>
> For plain text output:
>
> /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i
> Q3V4Q0.fasta -m 0 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp
>> Q3V4Q0.psiblast.log
>
> For XML:
>
> /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i
> Q3V4Q0.fasta -m 7 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp
>> Q3V4Q0.psiblast.xml.log
Because you capture the stdout to a file (rather than using the -o option),
the output files should be identical to those obtained by the python script.
I would need to install the same BLAST database etc in order to try and
debug this on my own machine, which is a hassle. So I'll try and ask you
to test a few things instead.
Could you try changing this line:
blast_out, error_info = NCBIStandalone.blastpgp(...)
to this:
temp_handle, error_info = NCBIStandalone.blastpgp(...)
from StringIO import StringIO
blast_out = StringIO(temp_handle.read())
temp_handle.close()
This will try to read in all the BLAST output (all 5MB of it) into memory
as a string, and turn it into a StringIO handle which the parser should
accept.
You could also try explicitly saving to a file:
temp_handle, error_info = NCBIStandalone.blastpgp(...)
temp_file = open("temp.txt", "w")
temp_file.write(temp_handle.read())
temp_file.close()
temp_handle.close()
blast_out = open("temp.txt")
or, perhaps:
temp_handle, error_info = NCBIStandalone.blastpgp(...)
temp_file = open("temp.txt", "w")
for line in temp_handle : temp_file.write(line)
temp_file.close()
temp_handle.close()
blast_out = open("temp.txt")
It would not surprise me to see these fail as before, but having a
look at the temp.txt file could be very instructive (especially if it
contains that odd query line you mentioned earlier).
I know that the Python subprocess module can have problems with
deadlocks when dealing with large amounts of piped data. There are
ways to cope, but the simplest option is to tell BLAST to save the
data to a file (instead of stdout) with the -o command line option.
This avoids sending large amounts of data via the stdout pipe. I can
explain how to do this within Biopython if you like (this email is
already very long).
Peter
More information about the Biopython
mailing list