[Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15

Fri May 23 10:06:44 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2502

biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |biopython-
                   |                            |bugzilla at maubp.freeserve.co.
                   |                            |uk

------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2008-05-23 06:06 EST -------
I've worked out that the original problem was use to trying to parse XML output
with the Bio.Blast.NCBIStandalone.PSIBlastParser (which expects the plain text
output only).  Perhaps the error message could be more helpful in this
situation?

I'm using Biopython from CVS, but it seems to parse the plain text output from
both 2.2.15 and 2.2.18 fine.  Here is a modified version of your code which
reads from the example plain text files provided:

#!/usr/bin/env python
#
import os, re, string, operator
from Bio.Blast import NCBIStandalone
from sys import *

E_VALUE_THRESH = 0.005

nolf = re.compile('\n')
nogaps = re.compile('-')

blast_out = open("blastpgp.2.2.18.txt")
b_parser = NCBIStandalone.PSIBlastParser()
b_record = b_parser.parse(blast_out)

if b_record.converged == 1:
    print '*** Converged!!! ***'

fastaout = open('test_psiblast.fasta','w')
summout = open('test_psiblast.txt','w')

for alignment in b_record.rounds[-1].alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            ident = 100.0*hsp.identities[0]/hsp.identities[1]
            simil = 100.0*hsp.positives[0]/hsp.positives[1]
            mytitle = nolf.sub(' ',alignment.title)
            mysbjct = nogaps.sub('',hsp.sbjct)
            summout.write('****Alignment****\n')
            summout.write('sequence: %s\n' % mytitle[0:70])
            summout.write('e value: %e\n' % hsp.expect)
            summout.write('alignment length: %i\n' % hsp.positives[1])
            summout.write('identity:   %(ident)5.2f\n' % {'ident': ident} )
            summout.write('similarity: %(simil)5.2f\n' % {'simil': simil} )
            summout.write('query:   from %i to %i\n' % (hsp.query_start,
hsp.query_end))
            summout.write('subject: from %i to %i\n' % (hsp.sbjct_start,
hsp.sbjct_end))
            summout.write('%s ...\n' % hsp.query[0:75])
            summout.write('%s ...\n' % hsp.match[0:75])
            summout.write('%s ...\n' % hsp.sbjct[0:75])
            fastaout.write('%s\n%s\n' % (mytitle,mysbjct))

summout.close()
fastaout.close()
print "Done"

----------------------------------------------------------------------

So, as far as I can tell, the plain text PSI Blast parser is fine .

As I do not have the relevant databases installed, I have not tried using
Biopython to call blastpgp to run PSI-Blast.  It could be there is a problem
here with specifying the output format...

As to the XML output, you can sort of parse this with Bio.Blast.NCBIXML and I
think you get back an iterator yielding a record for each iteration.  However,
as the example you provided had only one query and one iteration, this should
be tested further.  The record is not showing all the information extracted by
the PSI-Blast text parse, which should be in the XML file.  Perhaps you would
like to investigate this?

Example code:

from Bio.Blast import NCBIStandalone, NCBIXML

for filename in ["blastpgp.2.2.15.txt", "blastpgp.2.2.18.txt"] :
    print
    print filename
    print "="*len(filename)
    handle = open(filename)
    record = NCBIStandalone.PSIBlastParser().parse(handle)
    print record.query
    if record.converged : print '*** Converged!!! ***'
    for iter_round in record.rounds :
        print "Iteration with %i alignments" \
              % (len(iter_round.alignments))
        print "%i new sequences, %i reused" \
              %(len(iter_round.new_seqs), len(iter_round.reused_seqs))
    print "End of plain text output"

for filename in ["blastpgp.2.2.15.xml", "blastpgp.2.2.18.xml"] :
    print
    print filename
    print "="*len(filename)
    handle = open(filename)
    for iter_round in NCBIXML.parse(handle) :
        print iter_round.query
        print "Iteration with %i alignments" \
              % (len(iter_round.alignments))
    print "End of XML output"

The output:

blastpgp.2.2.15.txt
===================
gi|50592994|ref|NP_003320.2| thioredoxin [Homo sapiens]
Iteration with 250 alignments
500 new sequences, 0 reused
End of plain text output

blastpgp.2.2.18.txt
===================
gi|50592994|ref|NP_003320.2| thioredoxin [Homo sapiens]
Iteration with 250 alignments
500 new sequences, 0 reused
End of plain text output

blastpgp.2.2.15.xml
===================
gi|50592994|ref|NP_003320.2| thioredoxin [Homo sapiens]
Iteration with 500 alignments
End of XML output

blastpgp.2.2.18.xml
===================
gi|50592994|ref|NP_003320.2| thioredoxin [Homo sapiens]
Iteration with 250 alignments
End of XML output

Notice that NCBI must have changed the XML format in some way (500 versus 250
alignments between versions 2.2.15 and 2.2.18).  I have not explored this in
any detail.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.