[Biopython] Problems parsing with PSIBlastParser

Miguel Ortiz Lombardia ibdeno at gmail.com
Thu Oct 15 15:33:59 UTC 2009


Le 15 oct. 09 à 17:03, Andrea a écrit :

> Miguel Ortiz Lombardia ha scritto:
>>
>> Le 15 oct. 09 à 15:54, Andrea a écrit :
>>
>>> Miguel Ortiz Lombardia ha scritto:
>>>>
>>>> Le 14 oct. 09 à 23:43, Andrea a écrit :
>>>>
>>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>> Le 14 oct. 09 à 17:02, Andrea a écrit :
>>>>>>> Peter ha scritto:
>>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea <andrea at biodec.com>  
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi to everybody,
>>>>>>>>> I work with blast quite often and i could say i run hundreds  
>>>>>>>>> of
>>>>>>>>> thousand
>>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite  
>>>>>>>>> common for
>>>>>>>>> me, and
>>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines,  
>>>>>>>>> before
>>>>>>>>> passing
>>>>>>>>> the output to the parser.
>>>>>>>>>
>>>>>>>>> I found these type of lines in at least 1-2% of my runs. And  
>>>>>>>>> i'm
>>>>>>>>> fully sure
>>>>>>>>> that i found them either in the output of blast via shell  
>>>>>>>>> and in
>>>>>>>>> the output
>>>>>>>>> of blast via Biopython.
>>>>>>>>>
>>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>>> maybe
>>>>>>>>> could be managed in biopython (as i did in my code), cutting  
>>>>>>>>> out
>>>>>>>>> these
>>>>>>>>> "Query 0" lines, because from the point of view of the  
>>>>>>>>> alignments,
>>>>>>>>> they don't have any sense. It seems that blastpgp, wants to  
>>>>>>>>> show
>>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>>> before the
>>>>>>>>> starting point of the query itself (something like opening a  
>>>>>>>>> gap,
>>>>>>>>> at the
>>>>>>>>> beginning of the query).
>>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Andrea - do you have any small example output files with this
>>>>>>>> problem? If it does occur fairly often (1 to 2% of the time),  
>>>>>>>> then
>>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>>> is useful for testing while working on a bug fix, but too big  
>>>>>>>> to
>>>>>>>> include as part the unit tests.
>>>>>>>>
>>>>>>>>
>>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>>> outputs.
>>>>>>> But I'm not
>>>>>>> sure i've the original (maybe already patched).... waht I'm  
>>>>>>> sure, is
>>>>>>> that in the
>>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll  
>>>>>>> for sure
>>>>>>> find
>>>>>>> something small. ;-)
>>>>>>>>> What i think, is that there aren't any problem with  
>>>>>>>>> biopython in
>>>>>>>>> wrapping
>>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>>> difference in
>>>>>>>>> the
>>>>>>>>> output could be related to small differences in the  
>>>>>>>>> parameter of
>>>>>>>>> the process
>>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>>
>>>>>>>>> I always was able to  observe  the identity  between the  
>>>>>>>>> blastpgp
>>>>>>>>> output
>>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>>> that is
>>>>>>>> worth knowing.
>>>>>>
>>>>>> All I can say is that this is not what I observe.
>>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>>> search ( I capture the full command line issued in the  
>>>>>> background by
>>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never  
>>>>>> find
>>>>>> the 'Query: 0' lines.
>>>>>> 2. When I send the search from within the python script and use
>>>>>> 'nohup', the problem is reproducible, not random.
>>>>> yes, i'm sure is reproducible. I  mean that what I've observed  
>>>>> wasn't
>>>>> random on one sequence, but maybe along
>>>>> many sequences...
>>>>>> 3. If the script is sent without 'nohup', that is, if the shell  
>>>>>> keeps
>>>>>> full control of both standard error and output, then again, the
>>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>>> with
>>>>>> my longest ( more than 1300 aa ) sequences.
>>>>>> 4. When, from within the python script I use, as Peter  
>>>>>> suggested, the
>>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>>> file
>>>>>> ( the -o option ) the problem disappears irrespectively whether  
>>>>>> I use
>>>>>> or not 'nohup'.
>>>>>>
>>>>>> Therefore, in my opinion, the problem is not with blastpgp but  
>>>>>> with
>>>>>> the handling of its output by python or biopython.
>>>>>>
>>>>> I'm really curious. What you have is very strange, but i believe  
>>>>> you
>>>>> fully.
>>>>>
>>>>> Is there the possibility to have:
>>>>> your database,
>>>>> your .bashrc
>>>>> the sequence
>>>>> the exact command line.
>>>>> the versione of blastpgp
>>>>> the versione of blastpgp (2.2.18 ?)
>>>>> the other things you use (matrix.... )
>>>>> the different possibilities you try....( nohup/python/shell )
>>>>> I should be reprodcible.
>>>>>
>>>>> Have you tried to observe the behaviour of the blastpgp process  
>>>>> with a
>>>>> "strace" expecially at the
>>>>> beginning?
>>>>>
>>>>>
>>>>>>>>
>>>>>>> i think so.
>>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>>> Because i'm
>>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>>
>>>>>>>> Maybe the environment variables are different or something?
>>>>>>
>>>>>> Command options are absolutely the same, see above. I am  
>>>>>> surprised
>>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>>> variable and I don't see how they could change from an in- 
>>>>>> script to a
>>>>>> standalone run.
>>>>> I think only to .bashrc.
>>>>>>
>>>>>>>>
>>>>>>>>> Despite, according to me there aren't any problem in  
>>>>>>>>> biopython,
>>>>>>>>> and
>>>>>>>>> maybe,
>>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>>> blastpgp is
>>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>>> mine),
>>>>>>>>> that
>>>>>>>>> could remove "Query 0" lines.
>>>>>>
>>>>>> I couldn't find any differences, so I'm afraid I can't help...  
>>>>>> I'm
>>>>>> still testing the script, I will let you know if I find again  
>>>>>> this
>>>>>> problem.
>>>>> I will try to find the problem in my sequences (but i could say
>>>>> that is
>>>>> quite common)... and if i will
>>>>> find i will try with the same parameters and the shell...
>>>>>>
>>>>>>>>>
>>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>>
>>>>>>> Now i'm wuite busy, because i'm working on a different project  
>>>>>>> and
>>>>>>> i've
>>>>>>> to manage deliveries...
>>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>>
>>>>>>>>> I aplogize if i understanded the problem wrongly and for the  
>>>>>>>>> fact
>>>>>>>>> that
>>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>>> discussion is finished)...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>> Ciao
>>>>>>> andrea
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Miguel
>>>>>>
>>>>>>
>>>>> thanks.
>>>>> Ciao
>>>>> Andrea
>>>>
>>>> Hi!
>>>>
>>>> Some new findings that contradict my previous perception of the
>>>> problem.
>>>> Tonight my script failed again after stumbling upon the same  
>>>> problem
>>>> for a different sequence. I have now investigated more carefully  
>>>> and
>>>> found:
>>>>
>>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>>> the blastpgp output) was encountered in all these cases:
>>>>
>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>> myscript.log &
>>>> b) myscript.py [some script options] sequences.fasta
>>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - 
>>>> j 5
>>>> -h 0.001 -p blastpgp
>>>>
>>>> That is, for the first time I was able to reproduce the problem  
>>>> from a
>>>> standalone run of blastpgp.
>>>>
>>>> 2. The problem disappears with a previous version of blastpgp
>>>> (2.2.18). Using this version, all these cases work:
>>>>
>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>> myscript.log &
>>>> b) myscript.py [some script options] sequences.fasta
>>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - 
>>>> j 5
>>>> -h 0.001 -p blastpgp
>>>>
>>>> So, it would seem that, as Andrea suggested, this is a bug in
>>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>>
>>>> 3. In this particular case, I notice that the problem happens  
>>>> with a
>>>> sequence containing low complexity region(s). Now, I had thought  
>>>> that
>>>> the default in blastpgp was to filter those sequences out! I'm  
>>>> running
>>>> the original script again with blastpgp-2.2.22 with the filter on  
>>>> (-F
>>>> T) to see if the problem persists.
>>>>
>>>> I will write to the blast-help address at the ncbi to let them know
>>>> about the problem.
>>>>
>>>> Best,
>>>>
>>>>
>>>> -- Miguel
>>>>
>>>>
>>> Hi,
>>> Thanks for your updates!!!. I can say one thing:
>>> I've used in the past these three versione of blastpgp:
>>> - 2.2.15
>>> - 2.2.18
>>> - 2.2.19
>>> and i found the "Query 0" problem in all of them, but, if one
>>> of them fails (i mean, gives "Query 0" output) the other may not  
>>> fail
>>> at all (they most probably not give the "Query 0" output).
>>>
>>> Another interesting things is that, with the three version, the same
>>> database, and the same parameters, the output is quite different...
>>> ...sorry.. very different...
>>>
>>> I'm also sure that it could happens also with the complexity  
>>> region(s)
>>> filter "True".
>>> What i observe, is that there aren't parameters that make it
>>> disappear. It
>>> just disappear from a sequence, and it will appear in another.... in
>>> other
>>> word, changing parameters, make it "moving"  between sequences.
>>>
>>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>>
>>> Thanks
>>> Andrea
>>
>>
>> Then it looks like something more weird than what I thought...
>> Andrea, would you mind if I send your e-mail to the blast people? Or
>> perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov
> If you can, for me is an help. I hope they will reply.
> I can also send and email, buti f you have....

I will do that, no problem

>>
>> I suspect they will tell us to use the XML output, but then, not all
>> info I need seems to go there...
> i think the same, and i suspect the XML output doesn't suffer of the
> same problem.

For me the XML is a no issue, since the NCBIXML parser does not really  
support PSI-BLAST searches:
it can't get information on the rounds, convergence... If you have a  
look to NCBIXML.py you see a lot of XXX TODO PSI...

>>
>> Thanks a lot!
>>
>>
> To you!!
>> -- Miguel
>>
>>
> And for my patch, is not a patch.I've checked now. To be fully  
> independent
> from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
> in the sense that actually i remove from the blastpgp output, four
> lines, starting
> from the "Query 0" one, and then i submit the "new output" to the  
> parser.
> In this way i'm reading the file twice (so it's not a good idea),  
> but i
> don't mind
> if the NcbiStandalone.py change, because I'm fully independent from  
> it.
>
> This is my "simple code":
>
> ## THIS IS NOT A PATCH. BUT IT WORKS.
> ## THIS MEANS THAT IF WE FIND THE WAY
> ## TO REMOVE FOUR LINES STARTING
> ## FROM "Query 0" THE PROBLEM IS REALLY
> ## SOLVED (NOW I DON'T HAVE PARSER
> ## PROBLEMS AT ALL).
> ## lines is a list derived from a  readlines() call of the
> ## output of blastpgp.
> ## newlines has to be reconverted into an handle
> ## object.
> def removeQuery0lines(lines):
>        newlines = []
>        count = 0
>        for l in lines:
>                if count == 4: count = 0
>                if count != 0: count+=1
>                if l.startswith('Query: 0'): count = 1
>                if count == 0: newlines.append(l)
>        return newlines
>

Thanks!

>
> It should be interesting to develope a patch that  works  inside the  
> parser.
> I will try to work on it, in November, becaue now i cannot.
> The right function to manipulate it should be (inside  
> NCBIStandalone.py):
>
> def _scan_hsp_alignment(self, uhandle, consumer):
>        # Query: 11
> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L      
> +   I R+
>        # Sbjct: 12
> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>        #
>        # Query: 64 AEKILIKR 71
>        #              I +K
>        # Sbjct: 70 PNIIQLKD 77
>        #
>
>        while 1:
>            # Blastn adds an extra line filled with spaces before Query
>            attempt_read_and_call(uhandle, consumer.noevent,  
> start='     ')
>            read_and_call(uhandle, consumer.query, start='Query')
>            read_and_call(uhandle, consumer.align, start='     ')
>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>            line = safe_peekline(uhandle)
>            # Alignment continues if I see a 'Query' or the spaces for
> Blastn.
>            if not (line.startswith('Query') or line.startswith('      
> ')):
>                break
>
> changing it in:
>
> def _scan_hsp_alignment(self, uhandle, consumer):
>        # Query: 11
> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L      
> +   I R+
>        # Sbjct: 12
> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>        #
>        # Query: 64 AEKILIKR 71
>        #              I +K
>        # Sbjct: 70 PNIIQLKD 77
>        #
>        while 1:
>            # Blastn adds an extra line filled with spaces before Query
>            attempt_read_and_call(uhandle, consumer.noevent,  
> start='     ')
>            # Remove Query 0 start (It is only at the beginning...)
>            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
> start='Query: 0')
>            if q0_count:
>                  # if "Query 0" remove its alignment
>                  read_and_call(uhandle, consumer.noevent,  
> start='     ')
>                  read_and_call(uhandle, consumer.noevent,  
> start='Sbjct')
>                  read_and_call_while(uhandle, consumer.noevent,  
> blank=1)
>            # Remove Query 0 end
>            read_and_call(uhandle, consumer.query, start='Query')
>            read_and_call(uhandle, consumer.align, start='     ')
>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>            line = safe_peekline(uhandle)
>            # Alignment continues if I see a 'Query' or the spaces for
> Blastn.
>            if not (line.startswith('Query') or line.startswith('      
> ')):
>                break
>
> BUT, i'm not sure of the patch and i didn't try at all... so i cannot
> submit... It needs to be tryed and tested!!!!
> And i'm also not sure if it is the right place to patch....!!!!
>
>
>
>
> I hope this could help....
> Miguel, have you time to try and test?
>

I'm afraid not in the next 6 weeks...

Best,



-- Miguel







More information about the Biopython mailing list