[Biopython] Problems parsing with PSIBlastParser

Andrea andrea at biodec.com
Thu Oct 15 15:03:38 UTC 2009


Miguel Ortiz Lombardia ha scritto:
>
> Le 15 oct. 09 à 15:54, Andrea a écrit :
>
>> Miguel Ortiz Lombardia ha scritto:
>>>
>>> Le 14 oct. 09 à 23:43, Andrea a écrit :
>>>
>>>> Miguel Ortiz Lombardia ha scritto:
>>>>> Le 14 oct. 09 à 17:02, Andrea a écrit :
>>>>>> Peter ha scritto:
>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea <andrea at biodec.com> wrote:
>>>>>>>
>>>>>>>> Hi to everybody,
>>>>>>>> I work with blast quite often and i could say i run hundreds of
>>>>>>>> thousand
>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for
>>>>>>>> me, and
>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines, before
>>>>>>>> passing
>>>>>>>> the output to the parser.
>>>>>>>>
>>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>>>>> fully sure
>>>>>>>> that i found them either in the output of blast via shell and in
>>>>>>>> the output
>>>>>>>> of blast via Biopython.
>>>>>>>>
>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>> maybe
>>>>>>>> could be managed in biopython (as i did in my code), cutting out
>>>>>>>> these
>>>>>>>> "Query 0" lines, because from the point of view of the alignments,
>>>>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>> before the
>>>>>>>> starting point of the query itself (something like opening a gap,
>>>>>>>> at the
>>>>>>>> beginning of the query).
>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>
>>>>>>>
>>>>>>> Andrea - do you have any small example output files with this
>>>>>>> problem? If it does occur fairly often (1 to 2% of the time), then
>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>> is useful for testing while working on a bug fix, but too big to
>>>>>>> include as part the unit tests.
>>>>>>>
>>>>>>>
>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>> outputs.
>>>>>> But I'm not
>>>>>> sure i've the original (maybe already patched).... waht I'm sure, is
>>>>>> that in the
>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for sure
>>>>>> find
>>>>>> something small. ;-)
>>>>>>>> What i think, is that there aren't any problem with biopython in
>>>>>>>> wrapping
>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>> difference in
>>>>>>>> the
>>>>>>>> output could be related to small differences in the parameter of
>>>>>>>> the process
>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>
>>>>>>>> I always was able to  observe  the identity  between the blastpgp
>>>>>>>> output
>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>
>>>>>>>
>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>> that is
>>>>>>> worth knowing.
>>>>>
>>>>> All I can say is that this is not what I observe.
>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>> search ( I capture the full command line issued in the background by
>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never find
>>>>> the 'Query: 0' lines.
>>>>> 2. When I send the search from within the python script and use
>>>>> 'nohup', the problem is reproducible, not random.
>>>> yes, i'm sure is reproducible. I  mean that what I've observed wasn't
>>>> random on one sequence, but maybe along
>>>> many sequences...
>>>>> 3. If the script is sent without 'nohup', that is, if the shell keeps
>>>>> full control of both standard error and output, then again, the
>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>> with
>>>>> my longest ( more than 1300 aa ) sequences.
>>>>> 4. When, from within the python script I use, as Peter suggested, the
>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>> file
>>>>> ( the -o option ) the problem disappears irrespectively whether I use
>>>>> or not 'nohup'.
>>>>>
>>>>> Therefore, in my opinion, the problem is not with blastpgp but with
>>>>> the handling of its output by python or biopython.
>>>>>
>>>> I'm really curious. What you have is very strange, but i believe you
>>>> fully.
>>>>
>>>> Is there the possibility to have:
>>>> your database,
>>>> your .bashrc
>>>> the sequence
>>>> the exact command line.
>>>> the versione of blastpgp
>>>> the versione of blastpgp (2.2.18 ?)
>>>> the other things you use (matrix.... )
>>>> the different possibilities you try....( nohup/python/shell )
>>>> I should be reprodcible.
>>>>
>>>> Have you tried to observe the behaviour of the blastpgp process with a
>>>> "strace" expecially at the
>>>> beginning?
>>>>
>>>>
>>>>>>>
>>>>>> i think so.
>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>> Because i'm
>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>
>>>>>>> Maybe the environment variables are different or something?
>>>>>
>>>>> Command options are absolutely the same, see above. I am surprised
>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>> variable and I don't see how they could change from an in-script to a
>>>>> standalone run.
>>>> I think only to .bashrc.
>>>>>
>>>>>>>
>>>>>>>> Despite, according to me there aren't any problem in biopython,
>>>>>>>> and
>>>>>>>> maybe,
>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>> blastpgp is
>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>> mine),
>>>>>>>> that
>>>>>>>> could remove "Query 0" lines.
>>>>>
>>>>> I couldn't find any differences, so I'm afraid I can't help... I'm
>>>>> still testing the script, I will let you know if I find again this
>>>>> problem.
>>>> I will try to find the problem in my sequences (but i could say
>>>> that is
>>>> quite common)... and if i will
>>>> find i will try with the same parameters and the shell...
>>>>>
>>>>>>>>
>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>
>>>>>> Now i'm wuite busy, because i'm working on a different project and
>>>>>> i've
>>>>>> to manage deliveries...
>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>
>>>>>>>> I aplogize if i understanded the problem wrongly and for the fact
>>>>>>>> that
>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>> discussion is finished)...
>>>>>>>>
>>>>>>>
>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>> Ciao
>>>>>> andrea
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>>
>>>>> -- Miguel
>>>>>
>>>>>
>>>> thanks.
>>>> Ciao
>>>> Andrea
>>>
>>> Hi!
>>>
>>> Some new findings that contradict my previous perception of the
>>> problem.
>>> Tonight my script failed again after stumbling upon the same problem
>>> for a different sequence. I have now investigated more carefully and
>>> found:
>>>
>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>> the blastpgp output) was encountered in all these cases:
>>>
>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>> myscript.log &
>>> b) myscript.py [some script options] sequences.fasta
>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>> -h 0.001 -p blastpgp
>>>
>>> That is, for the first time I was able to reproduce the problem from a
>>> standalone run of blastpgp.
>>>
>>> 2. The problem disappears with a previous version of blastpgp
>>> (2.2.18). Using this version, all these cases work:
>>>
>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>> myscript.log &
>>> b) myscript.py [some script options] sequences.fasta
>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>> -h 0.001 -p blastpgp
>>>
>>> So, it would seem that, as Andrea suggested, this is a bug in
>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>
>>> 3. In this particular case, I notice that the problem happens with a
>>> sequence containing low complexity region(s). Now, I had thought that
>>> the default in blastpgp was to filter those sequences out! I'm running
>>> the original script again with blastpgp-2.2.22 with the filter on (-F
>>> T) to see if the problem persists.
>>>
>>> I will write to the blast-help address at the ncbi to let them know
>>> about the problem.
>>>
>>> Best,
>>>
>>>
>>> -- Miguel
>>>
>>>
>> Hi,
>> Thanks for your updates!!!. I can say one thing:
>> I've used in the past these three versione of blastpgp:
>>  - 2.2.15
>>  - 2.2.18
>>  - 2.2.19
>> and i found the "Query 0" problem in all of them, but, if one
>> of them fails (i mean, gives "Query 0" output) the other may not fail
>> at all (they most probably not give the "Query 0" output).
>>
>> Another interesting things is that, with the three version, the same
>> database, and the same parameters, the output is quite different...
>> ...sorry.. very different...
>>
>> I'm also sure that it could happens also with the complexity region(s)
>> filter "True".
>> What i observe, is that there aren't parameters that make it
>> disappear. It
>> just disappear from a sequence, and it will appear in another.... in
>> other
>> word, changing parameters, make it "moving"  between sequences.
>>
>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>
>> Thanks
>> Andrea
>
>
> Then it looks like something more weird than what I thought...
> Andrea, would you mind if I send your e-mail to the blast people? Or
> perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov
If you can, for me is an help. I hope they will reply.
I can also send and email, buti f you have....
>
> I suspect they will tell us to use the XML output, but then, not all
> info I need seems to go there...
i think the same, and i suspect the XML output doesn't suffer of the
same problem.
>
> Thanks a lot!
>
>
To you!!
> -- Miguel
>
>
And for my patch, is not a patch.I've checked now. To be fully independent
from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
in the sense that actually i remove from the blastpgp output, four
lines, starting
from the "Query 0" one, and then i submit the "new output" to the parser.
In this way i'm reading the file twice (so it's not a good idea), but i
don't mind
if the NcbiStandalone.py change, because I'm fully independent from it.

This is my "simple code":

## THIS IS NOT A PATCH. BUT IT WORKS.
## THIS MEANS THAT IF WE FIND THE WAY
## TO REMOVE FOUR LINES STARTING
## FROM "Query 0" THE PROBLEM IS REALLY
## SOLVED (NOW I DON'T HAVE PARSER
## PROBLEMS AT ALL).
## lines is a list derived from a  readlines() call of the
## output of blastpgp.
## newlines has to be reconverted into an handle
## object.
def removeQuery0lines(lines):
        newlines = []
        count = 0
        for l in lines:
                if count == 4: count = 0
                if count != 0: count+=1
                if l.startswith('Query: 0'): count = 1
                if count == 0: newlines.append(l)
        return newlines


It should be interesting to develope a patch that  works  inside the parser.
I will try to work on it, in November, becaue now i cannot.
The right function to manipulate it should be (inside NCBIStandalone.py):

def _scan_hsp_alignment(self, uhandle, consumer):
        # Query: 11
GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
        #           GRGVS+         TC    Y  + + V GGG+ + EE   L     +   I R+
        # Sbjct: 12
GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
        #
        # Query: 64 AEKILIKR 71
        #              I +K
        # Sbjct: 70 PNIIQLKD 77
        #

        while 1:
            # Blastn adds an extra line filled with spaces before Query
            attempt_read_and_call(uhandle, consumer.noevent, start='     ')
            read_and_call(uhandle, consumer.query, start='Query')
            read_and_call(uhandle, consumer.align, start='     ')
            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
            read_and_call_while(uhandle, consumer.noevent, blank=1)
            line = safe_peekline(uhandle)
            # Alignment continues if I see a 'Query' or the spaces for
Blastn.
            if not (line.startswith('Query') or line.startswith('     ')):
                break

changing it in:

def _scan_hsp_alignment(self, uhandle, consumer):
        # Query: 11
GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
        #           GRGVS+         TC    Y  + + V GGG+ + EE   L     +   I R+
        # Sbjct: 12
GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
        #
        # Query: 64 AEKILIKR 71
        #              I +K
        # Sbjct: 70 PNIIQLKD 77
        #
        while 1:
            # Blastn adds an extra line filled with spaces before Query
            attempt_read_and_call(uhandle, consumer.noevent, start='     ')
            # Remove Query 0 start (It is only at the beginning...)
            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
start='Query: 0')
            if q0_count:
                  # if "Query 0" remove its alignment
                  read_and_call(uhandle, consumer.noevent, start='     ')
                  read_and_call(uhandle, consumer.noevent, start='Sbjct')
                  read_and_call_while(uhandle, consumer.noevent, blank=1)
            # Remove Query 0 end
            read_and_call(uhandle, consumer.query, start='Query')
            read_and_call(uhandle, consumer.align, start='     ')
            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
            read_and_call_while(uhandle, consumer.noevent, blank=1)
            line = safe_peekline(uhandle)
            # Alignment continues if I see a 'Query' or the spaces for
Blastn.
            if not (line.startswith('Query') or line.startswith('     ')):
                break

BUT, i'm not sure of the patch and i didn't try at all... so i cannot
submit... It needs to be tryed and tested!!!!
And i'm also not sure if it is the right place to patch....!!!!




I hope this could help....
Miguel, have you time to try and test?

Thanks a lot.
Andrea




More information about the Biopython mailing list