[Biopython] Problems parsing with PSIBlastParser
Miguel Ortiz Lombardia
ibdeno at gmail.com
Thu Oct 15 09:04:33 EDT 2009
Le 14 oct. 09 à 23:43, Andrea a écrit :
> Miguel Ortiz Lombardia ha scritto:
>> Le 14 oct. 09 à 17:02, Andrea a écrit :
>>> Peter ha scritto:
>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea <andrea at biodec.com> wrote:
>>>>
>>>>> Hi to everybody,
>>>>> I work with blast quite often and i could say i run hundreds of
>>>>> thousand
>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for
>>>>> me, and
>>>>> i wrote a patch to my code, to remove these "nasty" lines, before
>>>>> passing
>>>>> the output to the parser.
>>>>>
>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>> fully sure
>>>>> that i found them either in the output of blast via shell and in
>>>>> the output
>>>>> of blast via Biopython.
>>>>>
>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>> maybe
>>>>> could be managed in biopython (as i did in my code), cutting out
>>>>> these
>>>>> "Query 0" lines, because from the point of view of the alignments,
>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>> which is the part of the target sequence align to the query
>>>>> before the
>>>>> starting point of the query itself (something like opening a gap,
>>>>> at the
>>>>> beginning of the query).
>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>
>>>>
>>>> Andrea - do you have any small example output files with this
>>>> problem? If it does occur fairly often (1 to 2% of the time), then
>>>> we should try and update the parser to cope. Miguel's example
>>>> is useful for testing while working on a bug fix, but too big to
>>>> include as part the unit tests.
>>>>
>>>>
>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>> outputs.
>>> But I'm not
>>> sure i've the original (maybe already patched).... waht I'm sure, is
>>> that in the
>>> next month I'm going to run almost 100.000 blasptpg so I'll for sure
>>> find
>>> something small. ;-)
>>>>> What i think, is that there aren't any problem with biopython in
>>>>> wrapping
>>>>> the blastpgp process and maybe, but i'm not sure, the difference
>>>>> in
>>>>> the
>>>>> output could be related to small differences in the parameter of
>>>>> the process
>>>>> (or in the environment... or in the .ncbirc file).
>>>>>
>>>>> I always was able to observe the identity between the blastpgp
>>>>> output
>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>
>>>>
>>>> If you saw "Query 0" output at the command line (shell), then
>>>> that is
>>>> worth knowing.
>>
>> All I can say is that this is not what I observe.
>> 1. When I send directly from the shell exactly the same blastpgp
>> search ( I capture the full command line issued in the background by
>> the python script with a 'ps -a | grep blastpgp' ) I have never find
>> the 'Query: 0' lines.
>> 2. When I send the search from within the python script and use
>> 'nohup', the problem is reproducible, not random.
> yes, i'm sure is reproducible. I mean that what I've observed wasn't
> random on one sequence, but maybe along
> many sequences...
>> 3. If the script is sent without 'nohup', that is, if the shell keeps
>> full control of both standard error and output, then again, the
>> problem seems to disappear. I say 'seems' because I haven't tried
>> with
>> my longest ( more than 1300 aa ) sequences.
>> 4. When, from within the python script I use, as Peter suggested, the
>> BlastpgpCommandline class to ask blastpgp to send the output to a
>> file
>> ( the -o option ) the problem disappears irrespectively whether I use
>> or not 'nohup'.
>>
>> Therefore, in my opinion, the problem is not with blastpgp but with
>> the handling of its output by python or biopython.
>>
> I'm really curious. What you have is very strange, but i believe you
> fully.
>
> Is there the possibility to have:
> your database,
> your .bashrc
> the sequence
> the exact command line.
> the versione of blastpgp
> the versione of blastpgp (2.2.18 ?)
> the other things you use (matrix.... )
> the different possibilities you try....( nohup/python/shell )
> I should be reprodcible.
>
> Have you tried to observe the behaviour of the blastpgp process with a
> "strace" expecially at the
> beginning?
>
>
>>>>
>>> i think so.
>>>>> Miguel, could you check if really everything is identical?
>>>>> Because i'm
>>>>> really surprised of such a strange behaviour....
>>>>
>>>> Maybe the environment variables are different or something?
>>
>> Command options are absolutely the same, see above. I am surprised
>> too, but I don't think blastpgp is sensitive to any environment
>> variable and I don't see how they could change from an in-script to a
>> standalone run.
> I think only to .bashrc.
>>
>>>>
>>>>> Despite, according to me there aren't any problem in biopython,
>>>>> and
>>>>> maybe,
>>>>> Miguel will be able to discover some differences in the way
>>>>> blastpgp is
>>>>> launched, i would suggest to develop a patch (i could submit
>>>>> mine),
>>>>> that
>>>>> could remove "Query 0" lines.
>>
>> I couldn't find any differences, so I'm afraid I can't help... I'm
>> still testing the script, I will let you know if I find again this
>> problem.
> I will try to find the problem in my sequences (but i could say that
> is
> quite common)... and if i will
> find i will try with the same parameters and the shell...
>>
>>>>>
>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>
>>> Now i'm wuite busy, because i'm working on a different project and
>>> i've
>>> to manage deliveries...
>>> but i will for sure upload my patch ASAP.
>>>>
>>>>> I aplogize if i understanded the problem wrongly and for the
>>>>> fact that
>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>> discussion is finished)...
>>>>>
>>>>
>>>> Well I don't (yet) understand what the problem is either ;)
>>>>
>>>> Peter
>>>>
>>> Ciao
>>> andrea
>>
>>
>> Best,
>>
>>
>>
>> -- Miguel
>>
>>
> thanks.
> Ciao
> Andrea
Hi!
Some new findings that contradict my previous perception of the problem.
Tonight my script failed again after stumbling upon the same problem
for a different sequence. I have now investigated more carefully and
found:
1. The problem (a line with 'Query: 0 ---' that impaired parsing of
the blastpgp output) was encountered in all these cases:
a) nohup myscript.py [some script options] sequences.fasta >&
myscript.log &
b) myscript.py [some script options] sequences.fasta
c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 -
h 0.001 -p blastpgp
That is, for the first time I was able to reproduce the problem from a
standalone run of blastpgp.
2. The problem disappears with a previous version of blastpgp
(2.2.18). Using this version, all these cases work:
a) nohup myscript.py [some script options] sequences.fasta >&
myscript.log &
b) myscript.py [some script options] sequences.fasta
c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 -
h 0.001 -p blastpgp
So, it would seem that, as Andrea suggested, this is a bug in
blastpgp, to be more precise, after blastpgp-2.2.18.
3. In this particular case, I notice that the problem happens with a
sequence containing low complexity region(s). Now, I had thought that
the default in blastpgp was to filter those sequences out! I'm running
the original script again with blastpgp-2.2.22 with the filter on (-F
T) to see if the problem persists.
I will write to the blast-help address at the ncbi to let them know
about the problem.
Best,
-- Miguel
More information about the Biopython
mailing list