[Biopython] Problems parsing with PSIBlastParser

Andrea andrea at biodec.com
Thu Oct 15 15:39:48 UTC 2009


Miguel Ortiz Lombardia ha scritto:
>
> Le 15 oct. 09 à 17:03, Andrea a écrit :
>
>> Miguel Ortiz Lombardia ha scritto:
>>>
>>> Le 15 oct. 09 à 15:54, Andrea a écrit :
>>>
>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>
>>>>> Le 14 oct. 09 à 23:43, Andrea a écrit :
>>>>>
>>>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>>> Le 14 oct. 09 à 17:02, Andrea a écrit :
>>>>>>>> Peter ha scritto:
>>>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea <andrea at biodec.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi to everybody,
>>>>>>>>>> I work with blast quite often and i could say i run hundreds of
>>>>>>>>>> thousand
>>>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common
>>>>>>>>>> for
>>>>>>>>>> me, and
>>>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines,
>>>>>>>>>> before
>>>>>>>>>> passing
>>>>>>>>>> the output to the parser.
>>>>>>>>>>
>>>>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>>>>>>> fully sure
>>>>>>>>>> that i found them either in the output of blast via shell and in
>>>>>>>>>> the output
>>>>>>>>>> of blast via Biopython.
>>>>>>>>>>
>>>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>>>> maybe
>>>>>>>>>> could be managed in biopython (as i did in my code), cutting out
>>>>>>>>>> these
>>>>>>>>>> "Query 0" lines, because from the point of view of the
>>>>>>>>>> alignments,
>>>>>>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>>>> before the
>>>>>>>>>> starting point of the query itself (something like opening a
>>>>>>>>>> gap,
>>>>>>>>>> at the
>>>>>>>>>> beginning of the query).
>>>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrea - do you have any small example output files with this
>>>>>>>>> problem? If it does occur fairly often (1 to 2% of the time),
>>>>>>>>> then
>>>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>>>> is useful for testing while working on a bug fix, but too big to
>>>>>>>>> include as part the unit tests.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>>>> outputs.
>>>>>>>> But I'm not
>>>>>>>> sure i've the original (maybe already patched).... waht I'm
>>>>>>>> sure, is
>>>>>>>> that in the
>>>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for
>>>>>>>> sure
>>>>>>>> find
>>>>>>>> something small. ;-)
>>>>>>>>>> What i think, is that there aren't any problem with biopython in
>>>>>>>>>> wrapping
>>>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>>>> difference in
>>>>>>>>>> the
>>>>>>>>>> output could be related to small differences in the parameter of
>>>>>>>>>> the process
>>>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>>>
>>>>>>>>>> I always was able to  observe  the identity  between the
>>>>>>>>>> blastpgp
>>>>>>>>>> output
>>>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>>>> that is
>>>>>>>>> worth knowing.
>>>>>>>
>>>>>>> All I can say is that this is not what I observe.
>>>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>>>> search ( I capture the full command line issued in the
>>>>>>> background by
>>>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never
>>>>>>> find
>>>>>>> the 'Query: 0' lines.
>>>>>>> 2. When I send the search from within the python script and use
>>>>>>> 'nohup', the problem is reproducible, not random.
>>>>>> yes, i'm sure is reproducible. I  mean that what I've observed
>>>>>> wasn't
>>>>>> random on one sequence, but maybe along
>>>>>> many sequences...
>>>>>>> 3. If the script is sent without 'nohup', that is, if the shell
>>>>>>> keeps
>>>>>>> full control of both standard error and output, then again, the
>>>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>>>> with
>>>>>>> my longest ( more than 1300 aa ) sequences.
>>>>>>> 4. When, from within the python script I use, as Peter
>>>>>>> suggested, the
>>>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>>>> file
>>>>>>> ( the -o option ) the problem disappears irrespectively whether
>>>>>>> I use
>>>>>>> or not 'nohup'.
>>>>>>>
>>>>>>> Therefore, in my opinion, the problem is not with blastpgp but with
>>>>>>> the handling of its output by python or biopython.
>>>>>>>
>>>>>> I'm really curious. What you have is very strange, but i believe you
>>>>>> fully.
>>>>>>
>>>>>> Is there the possibility to have:
>>>>>> your database,
>>>>>> your .bashrc
>>>>>> the sequence
>>>>>> the exact command line.
>>>>>> the versione of blastpgp
>>>>>> the versione of blastpgp (2.2.18 ?)
>>>>>> the other things you use (matrix.... )
>>>>>> the different possibilities you try....( nohup/python/shell )
>>>>>> I should be reprodcible.
>>>>>>
>>>>>> Have you tried to observe the behaviour of the blastpgp process
>>>>>> with a
>>>>>> "strace" expecially at the
>>>>>> beginning?
>>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>> i think so.
>>>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>>>> Because i'm
>>>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>>>
>>>>>>>>> Maybe the environment variables are different or something?
>>>>>>>
>>>>>>> Command options are absolutely the same, see above. I am surprised
>>>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>>>> variable and I don't see how they could change from an in-script
>>>>>>> to a
>>>>>>> standalone run.
>>>>>> I think only to .bashrc.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Despite, according to me there aren't any problem in biopython,
>>>>>>>>>> and
>>>>>>>>>> maybe,
>>>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>>>> blastpgp is
>>>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>>>> mine),
>>>>>>>>>> that
>>>>>>>>>> could remove "Query 0" lines.
>>>>>>>
>>>>>>> I couldn't find any differences, so I'm afraid I can't help... I'm
>>>>>>> still testing the script, I will let you know if I find again this
>>>>>>> problem.
>>>>>> I will try to find the problem in my sequences (but i could say
>>>>>> that is
>>>>>> quite common)... and if i will
>>>>>> find i will try with the same parameters and the shell...
>>>>>>>
>>>>>>>>>>
>>>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>>>
>>>>>>>> Now i'm wuite busy, because i'm working on a different project and
>>>>>>>> i've
>>>>>>>> to manage deliveries...
>>>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>>>
>>>>>>>>>> I aplogize if i understanded the problem wrongly and for the
>>>>>>>>>> fact
>>>>>>>>>> that
>>>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>>>> discussion is finished)...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>> Ciao
>>>>>>>> andrea
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Miguel
>>>>>>>
>>>>>>>
>>>>>> thanks.
>>>>>> Ciao
>>>>>> Andrea
>>>>>
>>>>> Hi!
>>>>>
>>>>> Some new findings that contradict my previous perception of the
>>>>> problem.
>>>>> Tonight my script failed again after stumbling upon the same problem
>>>>> for a different sequence. I have now investigated more carefully and
>>>>> found:
>>>>>
>>>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>>>> the blastpgp output) was encountered in all these cases:
>>>>>
>>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>>> myscript.log &
>>>>> b) myscript.py [some script options] sequences.fasta
>>>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>>>> -h 0.001 -p blastpgp
>>>>>
>>>>> That is, for the first time I was able to reproduce the problem
>>>>> from a
>>>>> standalone run of blastpgp.
>>>>>
>>>>> 2. The problem disappears with a previous version of blastpgp
>>>>> (2.2.18). Using this version, all these cases work:
>>>>>
>>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>>> myscript.log &
>>>>> b) myscript.py [some script options] sequences.fasta
>>>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>>>> -h 0.001 -p blastpgp
>>>>>
>>>>> So, it would seem that, as Andrea suggested, this is a bug in
>>>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>>>
>>>>> 3. In this particular case, I notice that the problem happens with a
>>>>> sequence containing low complexity region(s). Now, I had thought that
>>>>> the default in blastpgp was to filter those sequences out! I'm
>>>>> running
>>>>> the original script again with blastpgp-2.2.22 with the filter on (-F
>>>>> T) to see if the problem persists.
>>>>>
>>>>> I will write to the blast-help address at the ncbi to let them know
>>>>> about the problem.
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> -- Miguel
>>>>>
>>>>>
>>>> Hi,
>>>> Thanks for your updates!!!. I can say one thing:
>>>> I've used in the past these three versione of blastpgp:
>>>> - 2.2.15
>>>> - 2.2.18
>>>> - 2.2.19
>>>> and i found the "Query 0" problem in all of them, but, if one
>>>> of them fails (i mean, gives "Query 0" output) the other may not fail
>>>> at all (they most probably not give the "Query 0" output).
>>>>
>>>> Another interesting things is that, with the three version, the same
>>>> database, and the same parameters, the output is quite different...
>>>> ...sorry.. very different...
>>>>
>>>> I'm also sure that it could happens also with the complexity region(s)
>>>> filter "True".
>>>> What i observe, is that there aren't parameters that make it
>>>> disappear. It
>>>> just disappear from a sequence, and it will appear in another.... in
>>>> other
>>>> word, changing parameters, make it "moving"  between sequences.
>>>>
>>>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>>>
>>>> Thanks
>>>> Andrea
>>>
>>>
>>> Then it looks like something more weird than what I thought...
>>> Andrea, would you mind if I send your e-mail to the blast people? Or
>>> perhaps you can do it yourself... I wrote to
>>> blast-help at ncbi.nlm.nih.gov
>> If you can, for me is an help. I hope they will reply.
>> I can also send and email, buti f you have....
>
> I will do that, no problem
>
>>>
>>> I suspect they will tell us to use the XML output, but then, not all
>>> info I need seems to go there...
>> i think the same, and i suspect the XML output doesn't suffer of the
>> same problem.
>
> For me the XML is a no issue, since the NCBIXML parser does not really
> support PSI-BLAST searches:
> it can't get information on the rounds, convergence... If you have a
> look to NCBIXML.py you see a lot of XXX TODO PSI...
>
>>>
>>> Thanks a lot!
>>>
>>>
>> To you!!
>>> -- Miguel
>>>
>>>
>> And for my patch, is not a patch.I've checked now. To be fully
>> independent
>> from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
>> in the sense that actually i remove from the blastpgp output, four
>> lines, starting
>> from the "Query 0" one, and then i submit the "new output" to the
>> parser.
>> In this way i'm reading the file twice (so it's not a good idea), but i
>> don't mind
>> if the NcbiStandalone.py change, because I'm fully independent from it.
>>
>> This is my "simple code":
>>
>> ## THIS IS NOT A PATCH. BUT IT WORKS.
>> ## THIS MEANS THAT IF WE FIND THE WAY
>> ## TO REMOVE FOUR LINES STARTING
>> ## FROM "Query 0" THE PROBLEM IS REALLY
>> ## SOLVED (NOW I DON'T HAVE PARSER
>> ## PROBLEMS AT ALL).
>> ## lines is a list derived from a  readlines() call of the
>> ## output of blastpgp.
>> ## newlines has to be reconverted into an handle
>> ## object.
>> def removeQuery0lines(lines):
>>        newlines = []
>>        count = 0
>>        for l in lines:
>>                if count == 4: count = 0
>>                if count != 0: count+=1
>>                if l.startswith('Query: 0'): count = 1
>>                if count == 0: newlines.append(l)
>>        return newlines
>>
>
> Thanks!
>
>>
>> It should be interesting to develope a patch that  works  inside the
>> parser.
>> I will try to work on it, in November, becaue now i cannot.
>> The right function to manipulate it should be (inside
>> NCBIStandalone.py):
>>
>> def _scan_hsp_alignment(self, uhandle, consumer):
>>        # Query: 11
>> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L    
>> +   I R+
>>        # Sbjct: 12
>> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>>        #
>>        # Query: 64 AEKILIKR 71
>>        #              I +K
>>        # Sbjct: 70 PNIIQLKD 77
>>        #
>>
>>        while 1:
>>            # Blastn adds an extra line filled with spaces before Query
>>            attempt_read_and_call(uhandle, consumer.noevent,
>> start='     ')
>>            read_and_call(uhandle, consumer.query, start='Query')
>>            read_and_call(uhandle, consumer.align, start='     ')
>>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            line = safe_peekline(uhandle)
>>            # Alignment continues if I see a 'Query' or the spaces for
>> Blastn.
>>            if not (line.startswith('Query') or line.startswith('    
>> ')):
>>                break
>>
>> changing it in:
>>
>> def _scan_hsp_alignment(self, uhandle, consumer):
>>        # Query: 11
>> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L    
>> +   I R+
>>        # Sbjct: 12
>> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>>        #
>>        # Query: 64 AEKILIKR 71
>>        #              I +K
>>        # Sbjct: 70 PNIIQLKD 77
>>        #
>>        while 1:
>>            # Blastn adds an extra line filled with spaces before Query
>>            attempt_read_and_call(uhandle, consumer.noevent,
>> start='     ')
>>            # Remove Query 0 start (It is only at the beginning...)
>>            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
>> start='Query: 0')
>>            if q0_count:
>>                  # if "Query 0" remove its alignment
>>                  read_and_call(uhandle, consumer.noevent, start='     ')
>>                  read_and_call(uhandle, consumer.noevent, start='Sbjct')
>>                  read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            # Remove Query 0 end
>>            read_and_call(uhandle, consumer.query, start='Query')
>>            read_and_call(uhandle, consumer.align, start='     ')
>>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            line = safe_peekline(uhandle)
>>            # Alignment continues if I see a 'Query' or the spaces for
>> Blastn.
>>            if not (line.startswith('Query') or line.startswith('    
>> ')):
>>                break
>>
>> BUT, i'm not sure of the patch and i didn't try at all... so i cannot
>> submit... It needs to be tryed and tested!!!!
>> And i'm also not sure if it is the right place to patch....!!!!
>>
>>
>>
>>
>> I hope this could help....
>> Miguel, have you time to try and test?
>>
>
> I'm afraid not in the next 6 weeks...
>
> Best,
>
>
>
> -- Miguel
>
>
So i will try in 3 weeks.. ;-)
And, as suggested from Peter, we will move the discussion to

http://bugzilla.open-bio.org/show_bug.cgi?id=2927

with some examples....

Ciao
Andrea





More information about the Biopython mailing list