[BioPython] import Standalone problems
Jacob Joseph
jmjoseph at andrew.cmu.edu
Wed Jul 19 05:00:00 UTC 2006
Hi.
I encountered similar difficulties over the past few days myself and
have made some improvements to the XML parser. Well, that is, it now
functions with blastall, but I have made no effort to parse the other
blast programs. I do not expect I have done any harm to other parsing,
however.
Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
yet spent significant time to clean up my changes. Without getting into
specific modifications, I have made an effort to make consistent the
variables in Record and NCBIXML, focusing primarily on what I needed
this week.
One portion I am not settled on reinitialization of Record.Blast at
every call to iterator.next(), and, by extension, BlastParser.parse().
See NCBIXML.py, line 114. Without re-initializing this class, we run
the risk of retaining portions of a Record from previously parsed
queries. This causes the bug 1970, mentioned below. Unfortunately,
this re-initialization exacts a significant performance penalty of at
least a factor of 10 by some rough measures. I would appreciate any
suggestions for improvement here.
I do apologize for not being more specific about my changes. When I get
a chance(next week?), I will package them up as a proper patch and file
a bug. Perhaps what I have done so far will be of use until then.
fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
not have separate <?xml> blocks within its output, requiring a different
method of iteration.
-Jacob
Peter wrote:
> Rohini Damle wrote:
>> Hi,
>> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> I am trying to extract alignment information for each of them.
>> So I wrote the following code:
>>
>> for b_record in b_iterator :
>>
>> E_VALUE_THRESH =20
>> for alignment in b_record.alignments:
>> for hsp in alignment.hsps:
>> if hsp.expect< E_VALUE_THRESH:
>>
>> print '****Alignment****'
>> print 'sequence:', alignment.title.split()[0]
>>
>> With this code, I am getting information for P1,
>> then information for P1 + P2
>> then for P1+P2 +P3
>> and finally for P1+P2+P3+P4
>> why this is so?
>> is there something wrong with the looping?
>
> I'm aware of something funny with the XML parsing, Bug 1970, which might
> well be the same issue:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>
> I confess I haven't looked into exactly what is going wrong here - too
> many other demands on my time to learn about XML and how BioPython
> parses it.
>
> Does the work around on the bug report help? Depending on which version
> of standalone blast you have installed, you might have better luck with
> plain text output - the trouble is this is a moving target and the NBCI
> keeps tweaking it.
>
> Peter
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIStandalone.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIXML.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Record.py
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060719/1e57c9bb/attachment-0002.ksh>
More information about the Biopython
mailing list