[BioPython] import Standalone problems

Wed Jul 19 05:00:00 UTC 2006

Hi.
I encountered similar difficulties over the past few days myself and
have made some improvements to the XML parser.  Well, that is, it now
functions with blastall, but I have made no effort to parse the other
blast programs.  I do not expect I have done any harm to other parsing,
however.

Attached are Record.py, NCBIStandalone.py, and NCBIXML.py.  I have not
yet spent significant time to clean up my changes.  Without getting into
specific modifications, I have made an effort to make consistent the
variables in Record and NCBIXML, focusing primarily on what I needed
this week.

One portion I am not settled on reinitialization of Record.Blast at
every call to iterator.next(), and, by extension, BlastParser.parse().
See NCBIXML.py, line 114.  Without re-initializing this class, we run
the risk of retaining portions of a Record from previously parsed
queries.   This causes the bug 1970, mentioned below.  Unfortunately,
this re-initialization exacts a significant performance penalty of at
least a factor of 10 by some rough measures.  I would appreciate any
suggestions for improvement here.

I do apologize for not being more specific about my changes.  When I get
a chance(next week?), I will package them up as a proper patch and file
a bug.  Perhaps what I have done so far will be of use until then.

fyi, I have done all of my testing with Blast 2.2.13.  2.2.14 seems to
not have separate <?xml> blocks within its output, requiring a different
method of iteration.

-Jacob

Peter wrote:
> Rohini Damle wrote:
>> Hi,
>> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> I am trying to extract alignment information for each of them.
>> So I wrote the following code:
>>
>>  for b_record in b_iterator :
>>
>>                 E_VALUE_THRESH =20
>>                 for alignment in b_record.alignments:
>>                        for hsp in alignment.hsps:
>>                        if hsp.expect< E_VALUE_THRESH:
>>
>>                             print '****Alignment****'
>>                             print 'sequence:', alignment.title.split()[0]
>>
>> With this code, I am getting information for P1,
>> then information for P1 + P2
>> then for P1+P2 +P3
>> and finally for P1+P2+P3+P4
>> why this is so?
>> is there something wrong with the looping?
> 
> I'm aware of something funny with the XML parsing, Bug 1970, which might 
> well be the same issue:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> 
> I confess I haven't looked into exactly what is going wrong here - too 
> many other demands on my time to learn about XML and how BioPython 
> parses it.
> 
> Does the work around on the bug report help?  Depending on which version 
> of standalone blast you have installed, you might have better luck with 
> plain text output - the trouble is this is a moving target and the NBCI 
> keeps tweaking it.
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIStandalone.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIXML.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Record.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment-0002.ksh>