[BioPython] blastall does not flush buffers due to biopython buffering?

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Mon May 12 10:10:07 UTC 2008


Michiel de Hoon wrote:
> Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on.

Maybe it is related to it, but sometimes blast process is probably close to exit:

mmokrejs  3343  3329  0 10:37 pts/2    00:00:00 [blastall] <defunct>

But sometimes I see it is still running with all its arguments and I can attach to it
by strace(1). So there are two issues.




I have hacked NCBIStandalone.blastall() like this:

+   print "Executing %s" % " ".join([blastcmd] + params)
    w, r, e = os.popen3(" ".join([blastcmd] + params))
    w.close()
    return File.UndoHandle(r), File.UndoHandle(e)




$ python test38a.py
Testing Bio.Blast.NCBIStandalone.blastall() functionality
Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_8zjEnKfa -m 0 -M NUC4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10
BLASTN 2.2.18 [Mar-02-2008]
...
$ python test38b.py
Testing blasttest.blastall() functionality
Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_TR0YHefa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999
Fetching blast results
Traceback (most recent call last):
  File "test38b.py", line 15, in <module>
    print ''.join(_error_info.readlines())
  File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines
    lines = self._saved + self._handle.readlines(*args,**keywds)
KeyboardInterrupt
$ python test38b.py
Testing blasttest.blastall() functionality
Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_meaeD7fa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10
Fetching blast results
BLASTN 2.2.18 [Mar-02-2008]
...
$ python test38b.py
Testing blasttest.blastall() functionality
Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_MRrSAWfa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 1 -b 9999999 -v 999
Fetching blast results
BLASTN 2.2.18 [Mar-02-2008]
...
$ python test38b.py
Testing blasttest.blastall() functionality
Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_VtMzeDfa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999
Fetching blast results
BLASTN 2.2.18 [Mar-02-2008]
...


So, it is not much reproducible. However, it is clear it has something
to do with the length of the output. As I have already said, I saw
by strace(1) many time that blastall did not write all its output.
I propose to ensure that the open3() or equivalent does not use buffered
outputs ("man perlopentut") and the UndoHandle avoided altogether.



$ cat test38a.py
#! /usr/bin/env python

from sys import path, argv
path.append('..')

from Bio.Blast import NCBIStandalone

print "Testing Bio.Blast.NCBIStandalone.blastall() functionality"

_we_got_some_results = 0

_blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', '/tmp/sequence_nucleic_acids_all.fa', '/tmp/blast_8zjEnKfa', matrix='NUC4.2', wordsize=4, gap_open=2, gap_extend=1, strands=1, alignments=9999999, descriptions=999, expectation=1, align_view=0)

while 1:
    _line = _blast_out.readline()
    if _line:
        print _line,
        _we_got_some_results = 1
    else:
        break

if _we_got_some_results:
    while 1:
        _line = _error_info.readline()
        if _line:
            print _line,
        else:
            break

$ cat test38b.py
#! /usr/bin/env python

import os
from sys import path, argv
path.append('..')

print "Testing blasttest.blastall() functionality"

import blasttest

_blast_out, _error_info, _blast_file = blasttest.blastall('/tmp/sequence_nucleic_acids_all.fa', 'CCGCCGTCGCGGGCAGTGTCTAGCCAGGCCTTGACAAGCTA', '/tmp', 'sequence')

# we have to read first from stdout of blast otherwise reading from stderr blocks
print "Fetching blast results"
print ''.join(_blast_out.readlines())
print "Fetching blast error messages"
print ''.join(_error_info.readlines())

os.remove(_blast_file)

$ cat blasttest.py
#! /usr/bin/env python

import os
import file_io
import tempfile

from Bio.Blast import NCBIStandalone

def blastall(blast_db, blast_query_string, tmpdir, mode, align_view=0, matrix='NUC.4.2'):
    _fd, _blast_file = tempfile.mkstemp(suffix='fa', prefix='blast_', dir=tmpdir, text=False)

    if blast_query_string[0] != '>':
        os.write(_fd, '>your_query\n' + blast_query_string.replace('\n','') + '\n')
    else:
        os.write(_fd, blast_query_string + '\n')

    os.close(_fd)
    del(_fd)

    if not file_io.file_size(_blast_file):
        raise ValueError, "No input sequence provided by user"
        return None, None

    if mode == 'sequence':
        _wordsize = 4
        _gap_open = 1
        _gap_extend = 1
        _strands = 1
        _alignments = 9999999 # -b
        _descriptions = 999 # -v
        _expectation = 1000 # -e 1e-20
        _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix=matrix, wordsize=_wordsize, gap_open=_gap_open, gap_extend=_gap_extend, strands=_strands, alignments=_alignments, descriptions=_descriptions, expectation=_expectation, align_view=align_view)

    return _blast_out, _error_info, _blast_file

Martin



More information about the Biopython mailing list