[BioPython] Problems with the standalone blast parser.

humberto@hpcf.upr.edu humberto@hpcf.upr.edu
Thu, 04 Apr 2002 15:25:24 -0400


--==_Exmh_84453588P
Content-Type: text/plain; charset=us-ascii

I've got a problem with the Bio.Blast.NCBIStandalone module in 
biopython-1.00a4 on a redhat 6.2 system. It was crashing partway through a 
script that runs a platefull of fasta formatted sequences through blast, I've 
tracked down the error to a single sequence. What I would like is for my 
script to continue, perhaps printing an error message, instead of crashing.

The following nucleotide sequence:

>A6BFLYA09
ACCCTCCCCCCCCCCTCCTTGCATACTGATTATCCCCACCTGTTACCCTC
CCCCCACACCCTCCCCCCCATCCCACCNCACACCCCACACCNCTNCTCCC
TCCACCACCCCCACNCCCCCCACCCACCCCCCCACCCCCCCACCCN

when run through this command line:

/usr/local/blast/blastall -p blastx -d /usr/local/blast/nr -i badsequence -o 
badblast.out

produces these warnings and errors:

[blastall] WARNING:  [000.000]  A6BFLYA09: SetUpBlastSearch failed.
[blastall] ERROR:  [000.000]  A6BFLYA09: BLASTSetUpSearch: Unable to calculate 
Karlin-Altschul params, check query sequence
[blastall] ERROR:  [000.000]  A6BFLYA09: BLASTSetUpSearch: Unable to calculate 
Karlin-Altschul params, check query sequence

NCBI has this to say about the error message:

Q: Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate
     Karlin-Altschul params, check query sequence" ? 

     This will happen if your entire query sequence has been masked by low 
complexity filtering. You
     will need to turn filtering off to get hits. For further information on 
filtering, please read the
     sections of the BLAST FAQs on Q: What is low-complexity sequence? and 
also Q: After
     running a search why do I see a string of "X"s (or "N"s) in my query 
sequence that I did not put
     there? 

But turning off the filtering gives me a bunch of bad quality hits.
the badblast.out file is produced, and is missing a section at the end:

--
BLASTX 2.2.2 [Dec-14-2001]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= A6BFLYA09
         (146 letters)

Database: /usr/local/blast/nr 
           855,904 sequences; 269,501,083 total letters

Searchingdone

 ***** No hits found ******

  Database: /usr/local/blast/nr
    Posted date:  Jan 22, 2002  7:32 AM
  Number of letters in database: 269,501,083
  Number of sequences in database:  855,904
  
Lambda     K      H
   0.311    0.135    0.440 

Gapped
Lambda     K      H
   0.267   0.0410    0.140 

--

Running this blast output through a simple biopython script crashes with a 
parse error:

python2.2 justparse badblast.out
Traceback (most recent call last):
  File "justparse", line 42, in ?
    justparse(my_blast_file)
  File "justparse", line 25, in justparse
    b_record = b_iterator.next()
  File "/usr/local/lib/python2.2/site-packages/Bio/Blast/NCBIStandalone.py", 
line 1353, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/local/lib/python2.2/site-packages/Bio/Blast/NCBIStandalone.py", 
line 515, in parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/local/lib/python2.2/site-packages/Bio/Blast/NCBIStandalone.py", 
line 85, in feed
    self._scan_database_report(uhandle, consumer)
  File "/usr/local/lib/python2.2/site-packages/Bio/Blast/NCBIStandalone.py", 
line 408, in _scan_database_report
    read_and_call_while(uhandle, consumer.noevent, blank=1)
  File "/usr/local/lib/python2.2/site-packages/Bio/ParserSupport.py", line 
345, in read_and_call_while
    line = safe_readline(uhandle)
  File "/usr/local/lib/python2.2/site-packages/Bio/ParserSupport.py", line 
442, in safe_readline
    raise SyntaxError, "Unexpected end of stream."
SyntaxError: Unexpected end of stream.

This is the script:

#!/usr/bin/env python2.2
# standard library
import sys

# biopython
from Bio.Blast import NCBIStandalone

my_blast_file = sys.argv[1]

def justparse(blastfile):

  blast_out = open(blastfile, "r")

  b_parser = NCBIStandalone.BlastParser()

  b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)

  while 1:
    b_record = b_iterator.next()

    if b_record is None:
        break

    
    if len(b_record.alignments) == 0:
	print "No hits found"
    else:
        print "Hits"

# Main
justparse(my_blast_file)

--

Is there a way of writing this script so that it skips the bad reports instead 
of crashing?

When I run my blast inside biopython:

  blast_out, error_info = NCBIStandalone.blastall(my_blast_exe, progname,
                                                blastdb, blastfile,
                                                descriptions=3, alignments=3,
                                                show_gi="T", strands=1)

I guess the errors are in the error_info variable, but the biopython tutorial 
doesn't show how to recover from an error in the parser (actually, it avoids 
getting into the topic).

-- 
Humberto Ortiz Zuazaga
Programmer-Archaeologist
High Performance Computing facility
University of Puerto Rico
http://www.hpcf.upr.edu/~humberto/



--==_Exmh_84453588P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.3 01/15/2001

iD8DBQE8rKikO8aX8Tqx8vgRAsiuAJ9Co6n23+GOFX5bf2MSW1tfZ23oFQCeNTiE
3Vm3+PtNDv8/NQkYDUYxp6A=
=ri8V
-----END PGP SIGNATURE-----

--==_Exmh_84453588P--