wrote:
> Hi.
> I encountered similar difficulties over the past few days myself and
> have made some improvements to the XML parser. Well, that is, it now
> functions with blastall, but I have made no effort to parse the other
> blast programs. I do not expect I have done any harm to other parsing,
> however.
>
> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> yet spent significant time to clean up my changes. Without getting into
> specific modifications, I have made an effort to make consistent the
> variables in Record and NCBIXML, focusing primarily on what I needed
> this week.
>
> One portion I am not settled on reinitialization of Record.Blast at
> every call to iterator.next(), and, by extension, BlastParser.parse().
> See NCBIXML.py, line 114. Without re-initializing this class, we run
> the risk of retaining portions of a Record from previously parsed
> queries. This causes the bug 1970, mentioned below. Unfortunately,
> this re-initialization exacts a significant performance penalty of at
> least a factor of 10 by some rough measures. I would appreciate any
> suggestions for improvement here.
>
> I do apologize for not being more specific about my changes. When I get
> a chance(next week?), I will package them up as a proper patch and file
> a bug. Perhaps what I have done so far will be of use until then.
>
> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> not have separate blocks within its output, requiring a different
> method of iteration.
>
> -Jacob
>
> Peter wrote:
> > Rohini Damle wrote:
> >> Hi,
> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> >> I am trying to extract alignment information for each of them.
> >> So I wrote the following code:
> >>
> >> for b_record in b_iterator :
> >>
> >> E_VALUE_THRESH =20
> >> for alignment in b_record.alignments:
> >> for hsp in alignment.hsps:
> >> if hsp.expect< E_VALUE_THRESH:
> >>
> >> print '****Alignment****'
> >> print 'sequence:', alignment.title.split()[0]
> >>
> >> With this code, I am getting information for P1,
> >> then information for P1 + P2
> >> then for P1+P2 +P3
> >> and finally for P1+P2+P3+P4
> >> why this is so?
> >> is there something wrong with the looping?
> >
> > I'm aware of something funny with the XML parsing, Bug 1970, which might
> > well be the same issue:
> >
> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >
> > I confess I haven't looked into exactly what is going wrong here - too
> > many other demands on my time to learn about XML and how BioPython
> > parses it.
> >
> > Does the work around on the bug report help? Depending on which version
> > of standalone blast you have installed, you might have better luck with
> > plain text output - the trouble is this is a moving target and the NBCI
> > keeps tweaking it.
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
> # Patches by Mike Poidinger to support multiple databases.
>
> """
> This module provides code to work with the standalone version of
> BLAST, either blastall or blastpgp, provided by the NCBI.
> http://www.ncbi.nlm.nih.gov/BLAST/
>
> Classes:
> LowQualityBlastError Except that indicates low quality query sequences.
> BlastParser Parses output from blast.
> BlastErrorParser Parses output and tries to diagnose possible errors.
> PSIBlastParser Parses output from psi-blast.
> Iterator Iterates over a file of blast results.
>
> _Scanner Scans output from standalone BLAST.
> _BlastConsumer Consumes output from blast.
> _PSIBlastConsumer Consumes output from psi-blast.
> _HeaderConsumer Consumes header information.
> _DescriptionConsumer Consumes description information.
> _AlignmentConsumer Consumes alignment information.
> _HSPConsumer Consumes hsp information.
> _DatabaseReportConsumer Consumes database report information.
> _ParametersConsumer Consumes parameters information.
>
> Functions:
> blastall Execute blastall.
> blastpgp Execute blastpgp.
> rpsblast Execute rpsblast.
>
> """
>
> from __future__ import generators
> import os
> import re
>
> from Bio import File
> from Bio.ParserSupport import *
> from Bio.Blast import Record
>
>
> class LowQualityBlastError(Exception):
> """Error caused by running a low quality sequence through BLAST.
>
> When low quality sequences (like GenBank entries containing only
> stretches of a single nucleotide) are BLASTed, they will result in
> BLAST generating an error and not being able to perform the BLAST.
> search. This error should be raised for the BLAST reports produced
> in this case.
> """
> pass
>
> class ShortQueryBlastError(Exception):
> """Error caused by running a short query sequence through BLAST.
>
> If the query sequence is too short, BLAST outputs warnings and errors:
> Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch failed.
> [blastall] ERROR: [000.000] AT1G08320: Blast:
> [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at least wordsize
> done
>
> This exception is raised when that condition is detected.
>
> """
> pass
>
>
> class _Scanner:
> """Scan BLAST output from blastall or blastpgp.
>
> Tested with blastall and blastpgp v2.0.10, v2.0.11
>
> Methods:
> feed Feed data into the scanner.
>
> """
> def feed(self, handle, consumer):
> """S.feed(handle, consumer)
>
> Feed in a BLAST report for scanning. handle is a file-like
> object that contains the BLAST report. consumer is a Consumer
> object that will receive events as the report is scanned.
>
> """
> if isinstance(handle, File.UndoHandle):
> uhandle = handle
> else:
> uhandle = File.UndoHandle(handle)
>
> # Try to fast-forward to the beginning of the blast report.
> read_and_call_until(uhandle, consumer.noevent, contains='BLAST')
> # Now scan the BLAST report.
> self._scan_header(uhandle, consumer)
> self._scan_rounds(uhandle, consumer)
> self._scan_database_report(uhandle, consumer)
> self._scan_parameters(uhandle, consumer)
>
> def _scan_header(self, uhandle, consumer):
> # BLASTP 2.0.10 [Aug-26-1999]
> #
> #
> # Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaf
> # Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> # "Gapped BLAST and PSI-BLAST: a new generation of protein database sea
> # programs", Nucleic Acids Res. 25:3389-3402.
> #
> # Query= test
> # (140 letters)
> #
> # Database: sdqib40-1.35.seg.fa
> # 1323 sequences; 223,339 total letters
> #
>
> consumer.start_header()
>
> read_and_call(uhandle, consumer.version, contains='BLAST')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the reference lines and the following blank line.
> # There might be a line, for qblast output.
> attempt_read_and_call(uhandle, consumer.noevent, start="")
> read_and_call(uhandle, consumer.reference, start='Reference')
> while 1:
> line = uhandle.readline()
> if is_blank_line(line) or line.startswith("RID"):
> consumer.noevent(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> break
> consumer.reference(line)
>
> # blastpgp has a Reference for composition-based statistics.
> if attempt_read_and_call(
> uhandle, consumer.reference, start="Reference"):
> read_and_call_until(uhandle, consumer.reference, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the Query lines and the following blank line.
> read_and_call(uhandle, consumer.query_info, start='Query=')
> read_and_call_until(uhandle, consumer.query_info, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the database lines and the following blank line.
> read_and_call_until(uhandle, consumer.database_info, end='total letters')
> read_and_call(uhandle, consumer.database_info, contains='sequences')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_header()
>
> def _scan_rounds(self, uhandle, consumer):
> # Scan a bunch of rounds.
> # Each round begins with either a "Searching......" line
> # or a 'Score E' line followed by descriptions and alignments.
> # The email server doesn't give the "Searching....." line.
> # If there is no 'Searching.....' line then you'll first see a
> # 'Results from round' line
>
> while 1:
> line = safe_peekline(uhandle)
> if (not line.startswith('Searching') and
> not line.startswith('Results from round') and
> re.search(r"Score +E", line) is None and
> line.find('No hits found') == -1):
> break
>
> self._scan_descriptions(uhandle, consumer)
> self._scan_alignments(uhandle, consumer)
>
> def _scan_descriptions(self, uhandle, consumer):
> # Searching..................................................done
> # Results from round 2
> #
> #
> # Sc
> # Sequences producing significant alignments: (b
> # Sequences used in model and found again:
> #
> # d1tde_2 3.4.1.4.4 (119-244) Thioredoxin reductase [Escherichia ...
> # d1tcob_ 1.31.1.5.16 Calcineurin regulatory subunit (B-chain) [B...
> # d1symb_ 1.31.1.2.2 Calcyclin (S100) [RAT (RATTUS NORVEGICUS)]
> #
> # Sequences not found previously or not previously below threshold:
> #
> # d1osa__ 1.31.1.5.11 Calmodulin [Paramecium tetraurelia]
> # d1aoza3 2.5.1.3.3 (339-552) Ascorbate oxidase [zucchini (Cucurb...
> #
>
> # If PSI-BLAST, may also have:
> #
> # CONVERGED!
>
> consumer.start_descriptions()
>
> # Read 'Searching'
> # This line seems to be missing in BLASTN 2.1.2 (others?)
> attempt_read_and_call(uhandle, consumer.noevent, start='Searching')
>
> # blastpgp 2.0.10 from NCBI 9/19/99 for Solaris sometimes crashes here.
> # If this happens, the handle will yield no more information.
> if not uhandle.peekline():
> raise SyntaxError, "Unexpected end of blast report. " + \
> "Looks suspiciously like a PSI-BLAST crash."
>
> # BLASTN 2.2.3 sometimes spews a bunch of warnings and errors here:
> # Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch
> # [blastall] ERROR: [000.000] AT1G08320: Blast:
> # [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at leas
> # done
> # Reported by David Weisman.
> # Check for these error lines and ignore them for now. Let
> # the BlastErrorParser deal with them.
> line = uhandle.peekline()
> if line.find("ERROR:") != -1 or line.startswith("done"):
> read_and_call_while(uhandle, consumer.noevent, contains="ERROR:")
> read_and_call(uhandle, consumer.noevent, start="done")
>
> # Check to see if this is PSI-BLAST.
> # If it is, the 'Searching' line will be followed by:
> # (version 2.0.10)
> # Searching.............................
> # Results from round 2
> # or (version 2.0.11)
> # Searching.............................
> #
> #
> # Results from round 2
>
> # Skip a bunch of blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> # Check for the results line if it's there.
> if attempt_read_and_call(uhandle, consumer.round, start='Results'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Three things can happen here:
> # 1. line contains 'Score E'
> # 2. line contains "No hits found"
> # 3. no descriptions
> # The first one begins a bunch of descriptions. The last two
> # indicates that no descriptions follow, and we should go straight
> # to the alignments.
> if not attempt_read_and_call(
> uhandle, consumer.description_header,
> has_re=re.compile(r'Score +E')):
> # Either case 2 or 3. Look for "No hits found".
> attempt_read_and_call(uhandle, consumer.no_hits,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> #Psiblast can repeat the Searching...No hits found section
> if attempt_read_and_call(uhandle, consumer.noevent,
> start='Searching'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> read_and_call(uhandle, consumer.noevent,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
> # Stop processing.
> return
>
> # Read the score header lines
> read_and_call(uhandle, consumer.description_header,
> start='Sequences producing')
>
> # If PSI-BLAST, read the 'Sequences used in model' line.
> attempt_read_and_call(uhandle, consumer.model_sequences,
> start='Sequences used in model')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the descriptions and the following blank lines, making
> # sure that there are descriptions.
> if not uhandle.peekline().startswith('Sequences not found'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # If PSI-BLAST, read the 'Sequences not found' line followed
> # by more descriptions. However, I need to watch out for the
> # case where there were no sequences not found previously, in
> # which case there will be no more descriptions.
> if attempt_read_and_call(uhandle, consumer.nonmodel_sequences,
> start='Sequences not found'):
> # Read the descriptions and the following blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> l = safe_peekline(uhandle)
> # Brad -- added check for QUERY. On some PSI-BLAST outputs
> # there will be a 'Sequences not found' line followed by no
> # descriptions. Check for this case since the first thing you'll
> # get is a blank line and then 'QUERY'
> if not l.startswith('CONVERGED') and l[0] != '>' \
> and not l.startswith('QUERY'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> attempt_read_and_call(uhandle, consumer.converged, start='CONVERGED')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
>
> def _scan_alignments(self, uhandle, consumer):
> # qblast inserts a helpful line here.
> attempt_read_and_call(uhandle, consumer.noevent, start="ALIGNMENTS")
>
> # First, check to see if I'm at the database report.
> line = safe_peekline(uhandle)
> if line.startswith(' Database'):
> return
> elif line[0] == '>':
> # XXX make a better check here between pairwise and masterslave
> self._scan_pairwise_alignments(uhandle, consumer)
> else:
> # XXX put in a check to make sure I'm in a masterslave alignment
> self._scan_masterslave_alignment(uhandle, consumer)
>
> def _scan_pairwise_alignments(self, uhandle, consumer):
> while 1:
> line = safe_peekline(uhandle)
> if line[0] != '>':
> break
> self._scan_one_pairwise_alignment(uhandle, consumer)
>
> def _scan_one_pairwise_alignment(self, uhandle, consumer):
> consumer.start_alignment()
>
> self._scan_alignment_header(uhandle, consumer)
>
> # Scan a bunch of score/alignment pairs.
> while 1:
> line = safe_peekline(uhandle)
> if not line.startswith(' Score'):
> break
> self._scan_hsp(uhandle, consumer)
> consumer.end_alignment()
>
> def _scan_alignment_header(self, uhandle, consumer):
> # >d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus
> # stearothermophilus]
> # Length = 81
> #
> read_and_call(uhandle, consumer.title, start='>')
> while 1:
> line = safe_readline(uhandle)
> if line.lstrip().startswith('Length ='):
> consumer.length(line)
> break
> elif is_blank_line(line):
> # Check to make sure I haven't missed the Length line
> raise SyntaxError, "I missed the Length in an alignment header"
> consumer.title(line)
>
> # Older versions of BLAST will have a line with some spaces.
> # Version 2.0.14 (maybe 2.0.13?) and above print a true blank line.
> if not attempt_read_and_call(uhandle, consumer.noevent,
> start=' '):
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp(self, uhandle, consumer):
> consumer.start_hsp()
> self._scan_hsp_header(uhandle, consumer)
> self._scan_hsp_alignment(uhandle, consumer)
> consumer.end_hsp()
>
> def _scan_hsp_header(self, uhandle, consumer):
> # Score = 22.7 bits (47), Expect = 2.5
> # Identities = 10/36 (27%), Positives = 18/36 (49%)
> # Strand = Plus / Plus
> # Frame = +3
> #
>
> read_and_call(uhandle, consumer.score, start=' Score')
> read_and_call(uhandle, consumer.identities, start=' Identities')
> # BLASTN
> attempt_read_and_call(uhandle, consumer.strand, start = ' Strand')
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frame, start = ' Frame')
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp_alignment(self, uhandle, consumer):
> # Query: 11 GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
> # GRGVS+ TC Y + + V GGG+ + EE L + I R+
> # Sbjct: 12 GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
> #
> # Query: 64 AEKILIKR 71
> # I +K
> # Sbjct: 70 PNIIQLKD 77
> #
>
> while 1:
> # Blastn adds an extra line filled with spaces before Query
> attempt_read_and_call(uhandle, consumer.noevent, start=' ')
> read_and_call(uhandle, consumer.query, start='Query')
> read_and_call(uhandle, consumer.align, start=' ')
> read_and_call(uhandle, consumer.sbjct, start='Sbjct')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> line = safe_peekline(uhandle)
> # Alignment continues if I see a 'Query' or the spaces for Blastn.
> if not (line.startswith('Query') or line.startswith(' ')):
> break
>
> def _scan_masterslave_alignment(self, uhandle, consumer):
> consumer.start_alignment()
> while 1:
> line = safe_readline(uhandle)
> # Check to see whether I'm finished reading the alignment.
> # This is indicated by 1) database section, 2) next psi-blast
> # round, which can also be a 'Results from round' if no
> # searching line is present
> # patch by chapmanb
> if line.startswith('Searching') or \
> line.startswith('Results from round'):
> uhandle.saveline(line)
> break
> elif line.startswith(' Database'):
> uhandle.saveline(line)
> break
> elif is_blank_line(line):
> consumer.noevent(line)
> else:
> consumer.multalign(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> consumer.end_alignment()
>
> def _scan_database_report(self, uhandle, consumer):
> # Database: sdqib40-1.35.seg.fa
> # Posted date: Nov 1, 1999 4:25 PM
> # Number of letters in database: 223,339
> # Number of sequences in database: 1323
> #
> # Lambda K H
> # 0.322 0.133 0.369
> #
> # Gapped
> # Lambda K H
> # 0.270 0.0470 0.230
> #
>
> consumer.start_database_report()
>
> # Subset of the database(s) listed below
> # Number of letters searched: 562,618,960
> # Number of sequences searched: 228,924
> if attempt_read_and_call(uhandle, consumer.noevent, start=" Subset"):
> read_and_call(uhandle, consumer.noevent, contains="letters")
> read_and_call(uhandle, consumer.noevent, contains="sequences")
> read_and_call(uhandle, consumer.noevent, start=" ")
>
> # Sameet Mehta reported seeing output from BLASTN 2.2.9 that
> # was missing the "Database" stanza completely.
> while attempt_read_and_call(uhandle, consumer.database,
> start=' Database'):
> # BLAT output ends abruptly here, without any of the other
> # information. Check to see if this is the case. If so,
> # then end the database report here gracefully.
> if not uhandle.peekline():
> consumer.end_database_report()
> return
>
> # Database can span multiple lines.
> read_and_call_until(uhandle, consumer.database, start=' Posted')
> read_and_call(uhandle, consumer.posted_date, start=' Posted')
> read_and_call(uhandle, consumer.num_letters_in_database,
> start=' Number of letters')
> read_and_call(uhandle, consumer.num_sequences_in_database,
> start=' Number of sequences')
> read_and_call(uhandle, consumer.noevent, start=' ')
>
> line = safe_readline(uhandle)
> uhandle.saveline(line)
> if line.find('Lambda') != -1:
> break
>
> read_and_call(uhandle, consumer.noevent, start='Lambda')
> read_and_call(uhandle, consumer.ka_params)
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> # not BLASTP
> attempt_read_and_call(uhandle, consumer.gapped, start='Gapped')
> # not TBLASTX
> if attempt_read_and_call(uhandle, consumer.noevent, start='Lambda'):
> read_and_call(uhandle, consumer.ka_params_gap)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, I need to be careful not to read past the end of the
> # file.
> try:
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> except SyntaxError, x:
> if str(x) != "Unexpected end of stream.":
> raise
> consumer.end_database_report()
>
> def _scan_parameters(self, uhandle, consumer):
> # Matrix: BLOSUM62
> # Gap Penalties: Existence: 11, Extension: 1
> # Number of Hits to DB: 50604
> # Number of Sequences: 1323
> # Number of extensions: 1526
> # Number of successful extensions: 6
> # Number of sequences better than 10.0: 5
> # Number of HSP's better than 10.0 without gapping: 5
> # Number of HSP's successfully gapped in prelim test: 0
> # Number of HSP's that attempted gapping in prelim test: 1
> # Number of HSP's gapped (non-prelim): 5
> # length of query: 140
> # length of database: 223,339
> # effective HSP length: 39
> # effective length of query: 101
> # effective length of database: 171,742
> # effective search space: 17345942
> # effective search space used: 17345942
> # T: 11
> # A: 40
> # X1: 16 ( 7.4 bits)
> # X2: 38 (14.8 bits)
> # X3: 64 (24.9 bits)
> # S1: 41 (21.9 bits)
> # S2: 42 (20.8 bits)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, check to make sure that the parameter section really
> # exists.
> if not uhandle.peekline():
> return
>
> # BLASTN 2.2.9 looks like it reverses the "Number of Hits" and
> # "Number of Sequences" lines.
> consumer.start_parameters()
>
> # Matrix line may be missing in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.matrix, start='Matrix')
> # not TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_penalties, start='Gap')
>
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_hits,
> start='Number of Hits')
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_extends,
> start='Number of extensions')
> read_and_call(uhandle, consumer.num_good_extends,
> start='Number of successful')
>
> read_and_call(uhandle, consumer.num_seqs_better_e,
> start='Number of sequences')
>
> # not BLASTN, TBLASTX
> if attempt_read_and_call(uhandle, consumer.hsps_no_gap,
> start="Number of HSP's better"):
> # BLASTN 2.2.9
> if attempt_read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's gapped:"):
> read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.noevent,
> start="Number of extra gapped extensions")
> else:
> read_and_call(uhandle, consumer.hsps_prelim_gapped,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.hsps_prelim_gap_attempted,
> start="Number of HSP's that")
> read_and_call(uhandle, consumer.hsps_gapped,
> start="Number of HSP's gapped")
> # not in blastx 2.2.1
> attempt_read_and_call(uhandle, consumer.query_length,
> has_re=re.compile(r"[Ll]ength of query"))
> read_and_call(uhandle, consumer.database_length,
> has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase"))
>
> # BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.noevent,
> start="Length adjustment")
> attempt_read_and_call(uhandle, consumer.effective_hsp_length,
> start='effective HSP')
> # Not in blastx 2.2.1
> attempt_read_and_call(
> uhandle, consumer.effective_query_length,
> has_re=re.compile(r'[Ee]ffective length of query'))
> read_and_call(
> uhandle, consumer.effective_database_length,
> has_re=re.compile(r'[Ee]ffective length of \s*[Dd]atabase'))
> # Not in blastx 2.2.1, added a ':' to distinguish between
> # this and the 'effective search space used' line
> attempt_read_and_call(
> uhandle, consumer.effective_search_space,
> has_re=re.compile(r'[Ee]ffective search space:'))
> # Does not appear in BLASTP 2.0.5
> attempt_read_and_call(
> uhandle, consumer.effective_search_space_used,
> has_re=re.compile(r'[Ee]ffective search space used'))
>
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frameshift, start='frameshift')
> # not in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.threshold, start='T')
> read_and_call(uhandle, consumer.window_size, start='A')
> read_and_call(uhandle, consumer.dropoff_1st_pass, start='X1')
> read_and_call(uhandle, consumer.gap_x_dropoff, start='X2')
> # not BLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final,
> start='X3')
> read_and_call(uhandle, consumer.gap_trigger, start='S1')
> # not in blastx 2.2.1
> # first we make sure we have additional lines to work with, if
> # not then the file is done and we don't have a final S2
> if not is_blank_line(uhandle.peekline(), allow_spaces=1):
> read_and_call(uhandle, consumer.blast_cutoff, start='S2')
>
> consumer.end_parameters()
>
> class BlastParser(AbstractParser):
> """Parses BLAST data into a Record.Blast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _BlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class PSIBlastParser(AbstractParser):
> """Parses BLAST data into a Record.PSIBlast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _PSIBlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class _HeaderConsumer:
> def start_header(self):
> self._header = Record.Header()
>
> def version(self, line):
> c = line.split()
> self._header.application = c[0]
> self._header.version = c[1]
> self._header.date = c[2][1:-1]
>
> def reference(self, line):
> if line.startswith('Reference: '):
> self._header.reference = line[11:]
> else:
> self._header.reference = self._header.reference + line
>
> def query_info(self, line):
> if line.startswith('Query= '):
> self._header.query = line[7:]
> elif not line.startswith(' '): # continuation of query_info
> self._header.query = "%s%s" % (self._header.query, line)
> else:
> letters, = _re_search(
> r"([0-9,]+) letters", line,
> "I could not find the number of letters in line\n%s" % line)
> self._header.query_letters = _safe_int(letters)
>
> def database_info(self, line):
> line = line.rstrip()
> if line.startswith('Database: '):
> self._header.database = line[10:]
> elif not line.endswith('total letters'):
> self._header.database = self._header.database + line.strip()
> else:
> sequences, letters =_re_search(
> r"([0-9,]+) sequences; ([0-9,-]+) total letters", line,
> "I could not find the sequences and letters in line\n%s" %line)
> self._header.database_sequences = _safe_int(sequences)
> self._header.database_letters = _safe_int(letters)
>
> def end_header(self):
> # Get rid of the trailing newlines
> self._header.reference = self._header.reference.rstrip()
> self._header.query = self._header.query.rstrip()
>
> class _DescriptionConsumer:
> def start_descriptions(self):
> self._descriptions = []
> self._model_sequences = []
> self._nonmodel_sequences = []
> self._converged = 0
> self._type = None
> self._roundnum = None
>
> self.__has_n = 0 # Does the description line contain an N value?
>
> def description_header(self, line):
> if line.startswith('Sequences producing'):
> cols = line.split()
> if cols[-1] == 'N':
> self.__has_n = 1
>
> def description(self, line):
> dh = self._parse(line)
> if self._type == 'model':
> self._model_sequences.append(dh)
> elif self._type == 'nonmodel':
> self._nonmodel_sequences.append(dh)
> else:
> self._descriptions.append(dh)
>
> def model_sequences(self, line):
> self._type = 'model'
>
> def nonmodel_sequences(self, line):
> self._type = 'nonmodel'
>
> def converged(self, line):
> self._converged = 1
>
> def no_hits(self, line):
> pass
>
> def round(self, line):
> if not line.startswith('Results from round'):
> raise SyntaxError, "I didn't understand the round line\n%s" % line
> self._roundnum = _safe_int(line[18:].strip())
>
> def end_descriptions(self):
> pass
>
> def _parse(self, description_line):
> line = description_line # for convenience
> dh = Record.Description()
>
> # I need to separate the score and p-value from the title.
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77 1
> # special cases to handle:
> # - title must be preserved exactly (including whitespaces)
> # - score could be equal to e-value (not likely, but what if??)
> # - sometimes there's an "N" score of '1'.
> cols = line.split()
> if len(cols) < 3:
> raise SyntaxError, \
> "Line does not appear to contain description:\n%s" % line
> if self.__has_n:
> i = line.rfind(cols[-1]) # find start of N
> i = line.rfind(cols[-2], 0, i) # find start of p-value
> i = line.rfind(cols[-3], 0, i) # find start of score
> else:
> i = line.rfind(cols[-1]) # find start of p-value
> i = line.rfind(cols[-2], 0, i) # find start of score
> if self.__has_n:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-3], cols[-2], cols[-1]
> else:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-2], cols[-1], 1
> dh.num_alignments = _safe_int(dh.num_alignments)
> dh.score = _safe_int(dh.score)
> dh.e = _safe_float(dh.e)
> return dh
>
> class _AlignmentConsumer:
> # This is a little bit tricky. An alignment can either be a
> # pairwise alignment or a multiple alignment. Since it's difficult
> # to know a-priori which one the blast record will contain, I'm going
> # to make one class that can parse both of them.
> def start_alignment(self):
> self._alignment = Record.Alignment()
> self._multiple_alignment = Record.MultipleAlignment()
>
> def title(self, line):
> self._alignment.title = "%s%s" % (self._alignment.title,
> line.lstrip())
>
> def length(self, line):
> self._alignment.length = line.split()[2]
> self._alignment.length = _safe_int(self._alignment.length)
>
> def multalign(self, line):
> # Standalone version uses 'QUERY', while WWW version uses blast_tmp.
> if line.startswith('QUERY') or line.startswith('blast_tmp'):
> # If this is the first line of the multiple alignment,
> # then I need to figure out how the line is formatted.
>
> # Format of line is:
> # QUERY 1 acttg...gccagaggtggtttattcagtctccataagagaggggacaaacg 60
> try:
> name, start, seq, end = line.split()
> except ValueError:
> raise SyntaxError, "I do not understand the line\n%s" \
> % line
> self._start_index = line.index(start, len(name))
> self._seq_index = line.index(seq,
> self._start_index+len(start))
> # subtract 1 for the space
> self._name_length = self._start_index - 1
> self._start_length = self._seq_index - self._start_index - 1
> self._seq_length = line.rfind(end) - self._seq_index - 1
>
> #self._seq_index = line.index(seq)
> ## subtract 1 for the space
> #self._seq_length = line.rfind(end) - self._seq_index - 1
> #self._start_index = line.index(start)
> #self._start_length = self._seq_index - self._start_index - 1
> #self._name_length = self._start_index
>
> # Extract the information from the line
> name = line[:self._name_length]
> name = name.rstrip()
> start = line[self._start_index:self._start_index+self._start_length]
> start = start.rstrip()
> if start:
> start = _safe_int(start)
> end = line[self._seq_index+self._seq_length:].rstrip()
> if end:
> end = _safe_int(end)
> seq = line[self._seq_index:self._seq_index+self._seq_length].rstrip()
> # right pad the sequence with spaces if necessary
> if len(seq) < self._seq_length:
> seq = seq + ' '*(self._seq_length-len(seq))
>
> # I need to make sure the sequence is aligned correctly with the query.
> # First, I will find the length of the query. Then, if necessary,
> # I will pad my current sequence with spaces so that they will line
> # up correctly.
>
> # Two possible things can happen:
> # QUERY
> # 504
> #
> # QUERY
> # 403
> #
> # Sequence 504 will need padding at the end. Since I won't know
> # this until the end of the alignment, this will be handled in
> # end_alignment.
> # Sequence 403 will need padding before being added to the alignment.
>
> align = self._multiple_alignment.alignment # for convenience
> align.append((name, start, seq, end))
>
> # This is old code that tried to line up all the sequences
> # in a multiple alignment by using the sequence title's as
> # identifiers. The problem with this is that BLAST assigns
> # different HSP's from the same sequence the same id. Thus,
> # in one alignment block, there may be multiple sequences with
> # the same id. I'm not sure how to handle this, so I'm not
> # going to.
>
> # # If the sequence is the query, then just add it.
> # if name == 'QUERY':
> # if len(align) == 0:
> # align.append((name, start, seq))
> # else:
> # aname, astart, aseq = align[0]
> # if name != aname:
> # raise SyntaxError, "Query is not the first sequence"
> # aseq = aseq + seq
> # align[0] = aname, astart, aseq
> # else:
> # if len(align) == 0:
> # raise SyntaxError, "I could not find the query sequence"
> # qname, qstart, qseq = align[0]
> #
> # # Now find my sequence in the multiple alignment.
> # for i in range(1, len(align)):
> # aname, astart, aseq = align[i]
> # if name == aname:
> # index = i
> # break
> # else:
> # # If I couldn't find it, then add a new one.
> # align.append((None, None, None))
> # index = len(align)-1
> # # Make sure to left-pad it.
> # aname, astart, aseq = name, start, ' '*(len(qseq)-len(seq))
> #
> # if len(qseq) != len(aseq) + len(seq):
> # # If my sequences are shorter than the query sequence,
> # # then I will need to pad some spaces to make them line up.
> # # Since I've already right padded seq, that means aseq
> # # must be too short.
> # aseq = aseq + ' '*(len(qseq)-len(aseq)-len(seq))
> # aseq = aseq + seq
> # if astart is None:
> # astart = start
> # align[index] = aname, astart, aseq
>
> def end_alignment(self):
> # Remove trailing newlines
> if self._alignment:
> self._alignment.title = self._alignment.title.rstrip()
>
> # This code is also obsolete. See note above.
> # If there's a multiple alignment, I will need to make sure
> # all the sequences are aligned. That is, I may need to
> # right-pad the sequences.
> # if self._multiple_alignment is not None:
> # align = self._multiple_alignment.alignment
> # seqlen = None
> # for i in range(len(align)):
> # name, start, seq = align[i]
> # if seqlen is None:
> # seqlen = len(seq)
> # else:
> # if len(seq) < seqlen:
> # seq = seq + ' '*(seqlen - len(seq))
> # align[i] = name, start, seq
> # elif len(seq) > seqlen:
> # raise SyntaxError, \
> # "Sequence %s is longer than the query" % name
>
> # Clean up some variables, if they exist.
> try:
> del self._seq_index
> del self._seq_length
> del self._start_index
> del self._start_length
> del self._name_length
> except AttributeError:
> pass
>
> class _HSPConsumer:
> def start_hsp(self):
> self._hsp = Record.HSP()
>
> def score(self, line):
> self._hsp.bits, self._hsp.score = _re_search(
> r"Score =\s*([0-9.e+]+) bits \(([0-9]+)\)", line,
> "I could not find the score in line\n%s" % line)
> self._hsp.score = _safe_float(self._hsp.score)
> self._hsp.bits = _safe_float(self._hsp.bits)
>
> x, y = _re_search(
> r"Expect\(?(\d*)\)? = +([0-9.e\-|\+]+)", line,
> "I could not find the expect in line\n%s" % line)
> if x:
> self._hsp.num_alignments = _safe_int(x)
> else:
> self._hsp.num_alignments = 1
> self._hsp.expect = _safe_float(y)
>
> def identities(self, line):
> x, y = _re_search(
> r"Identities = (\d+)\/(\d+)", line,
> "I could not find the identities in line\n%s" % line)
> self._hsp.identities = _safe_int(x), _safe_int(y)
>
> if line.find('Positives') != -1:
> x, y = _re_search(
> r"Positives = (\d+)\/(\d+)", line,
> "I could not find the positives in line\n%s" % line)
> self._hsp.positives = _safe_int(x), _safe_int(y)
>
> if line.find('Gaps') != -1:
> x, y = _re_search(
> r"Gaps = (\d+)\/(\d+)", line,
> "I could not find the gaps in line\n%s" % line)
> self._hsp.gaps = _safe_int(x), _safe_int(y)
>
>
> def strand(self, line):
> self._hsp.strand = _re_search(
> r"Strand = (\w+) / (\w+)", line,
> "I could not find the strand in line\n%s" % line)
>
> def frame(self, line):
> # Frame can be in formats:
> # Frame = +1
> # Frame = +2 / +2
> if line.find('/') != -1:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123]) / ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
> else:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
>
> # Match a space, if one is available. Masahir Ishikawa found a
> # case where there's no space between the start and the sequence:
> # Query: 100tt 101
> # line below modified by Yair Benita, Sep 2004
> _query_re = re.compile(r"Query: \s*(\d+)\s*(.+) (\d+)")
> def query(self, line):
> m = self._query_re.search(line)
> if m is None:
> raise SyntaxError, "I could not find the query in line\n%s" % line
>
> # line below modified by Yair Benita, Sep 2004.
> # added the end attribute for the query
> start, seq, end = m.groups()
> self._hsp.query = self._hsp.query + seq
> if self._hsp.query_start is None:
> self._hsp.query_start = _safe_int(start)
>
> # line below added by Yair Benita, Sep 2004.
> # added the end attribute for the query
> self._hsp.query_end = _safe_int(end)
> self._query_start_index = m.start(2)
> self._query_len = len(seq)
>
> def align(self, line):
> seq = line[self._query_start_index:].rstrip()
> if len(seq) < self._query_len:
> # Make sure the alignment is the same length as the query
> seq = seq + ' ' * (self._query_len-len(seq))
> elif len(seq) < self._query_len:
> raise SyntaxError, "Match is longer than the query in line\n%s" % \
> line
> self._hsp.match = self._hsp.match + seq
>
> def sbjct(self, line):
> # line below modified by Yair Benita, Sep 2004
> # added the end group and the -? to allow parsing
> # of BLAT output in BLAST format.
> start, seq, end = _re_search(
> r"Sbjct: (-?\d+)\s*(.+) (-?\d+)", line,
> "I could not find the sbjct in line\n%s" % line)
> #mikep 26/9/00
> #On occasion, there is a blast hit with no subject match
> #so far, it only occurs with 1-line short "matches"
> #I have decided to let these pass as they appear
> if not seq.strip():
> seq = ' ' * self._query_len
> self._hsp.sbjct = self._hsp.sbjct + seq
> if self._hsp.sbjct_start is None:
> self._hsp.sbjct_start = _safe_int(start)
>
> self._hsp.sbjct_end = _safe_int(end)
> if len(seq) != self._query_len:
> raise SyntaxError, \
> "QUERY and SBJCT sequence lengths don't match in line\n%s" \
> % line
>
> del self._query_start_index # clean up unused variables
> del self._query_len
>
> def end_hsp(self):
> pass
>
> class _DatabaseReportConsumer:
>
> def start_database_report(self):
> self._dr = Record.DatabaseReport()
>
> def database(self, line):
> m = re.search(r"Database: (.+)$", line)
> if m:
> self._dr.database_name.append(m.group(1))
> elif self._dr.database_name:
> # This must be a continuation of the previous name.
> self._dr.database_name[-1] = "%s%s" % (self._dr.database_name[-1],
> line.strip())
>
> def posted_date(self, line):
> self._dr.posted_date.append(_re_search(
> r"Posted date:\s*(.+)$", line,
> "I could not find the posted date in line\n%s" % line))
>
> def num_letters_in_database(self, line):
> letters, = _get_cols(
> line, (-1,), ncols=6, expected={2:"letters", 4:"database:"})
> self._dr.num_letters_in_database.append(_safe_int(letters))
>
> def num_sequences_in_database(self, line):
> sequences, = _get_cols(
> line, (-1,), ncols=6, expected={2:"sequences", 4:"database:"})
> self._dr.num_sequences_in_database.append(_safe_int(sequences))
>
> def ka_params(self, line):
> x = line.split()
> self._dr.ka_params = map(_safe_float, x)
>
> def gapped(self, line):
> self._dr.gapped = 1
>
> def ka_params_gap(self, line):
> x = line.split()
> self._dr.ka_params_gap = map(_safe_float, x)
>
> def end_database_report(self):
> pass
>
> class _ParametersConsumer:
> def start_parameters(self):
> self._params = Record.Parameters()
>
> def matrix(self, line):
> self._params.matrix = line[8:].rstrip()
>
> def gap_penalties(self, line):
> x = _get_cols(
> line, (3, 5), ncols=6, expected={2:"Existence:", 4:"Extension:"})
> self._params.gap_penalties = map(_safe_float, x)
>
> def num_hits(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=11, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=6, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
>
> def num_sequences(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
>
> def num_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
>
> def num_good_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=10, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=5, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
>
> def num_seqs_better_e(self, line):
> self._params.num_seqs_better_e, = _get_cols(
> line, (-1,), ncols=7, expected={2:"sequences"})
> self._params.num_seqs_better_e = _safe_int(
> self._params.num_seqs_better_e)
>
> def hsps_no_gap(self, line):
> self._params.hsps_no_gap, = _get_cols(
> line, (-1,), ncols=9, expected={3:"better", 7:"gapping:"})
> self._params.hsps_no_gap = _safe_int(self._params.hsps_no_gap)
>
> def hsps_prelim_gapped(self, line):
> self._params.hsps_prelim_gapped, = _get_cols(
> line, (-1,), ncols=9, expected={4:"gapped", 6:"prelim"})
> self._params.hsps_prelim_gapped = _safe_int(
> self._params.hsps_prelim_gapped)
>
> def hsps_prelim_gapped_attempted(self, line):
> self._params.hsps_prelim_gapped_attempted, = _get_cols(
> line, (-1,), ncols=10, expected={4:"attempted", 7:"prelim"})
> self._params.hsps_prelim_gapped_attempted = _safe_int(
> self._params.hsps_prelim_gapped_attempted)
>
> def hsps_gapped(self, line):
> self._params.hsps_gapped, = _get_cols(
> line, (-1,), ncols=6, expected={3:"gapped"})
> self._params.hsps_gapped = _safe_int(self._params.hsps_gapped)
>
> def query_length(self, line):
> self._params.query_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"query:"})
> self._params.query_length = _safe_int(self._params.query_length)
>
> def database_length(self, line):
> self._params.database_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"database:"})
> self._params.database_length = _safe_int(self._params.database_length)
>
> def effective_hsp_length(self, line):
> self._params.effective_hsp_length, = _get_cols(
> line, (-1,), ncols=4, expected={1:"HSP", 2:"length:"})
> self._params.effective_hsp_length = _safe_int(
> self._params.effective_hsp_length)
>
> def effective_query_length(self, line):
> self._params.effective_query_length, = _get_cols(
> line, (-1,), ncols=5, expected={1:"length", 3:"query:"})
> self._params.effective_query_length = _safe_int(
> self._params.effective_query_length)
>
> def effective_database_length(self, line):
> self._params.effective_database_length, = _get_cols(
> line.lower(), (-1,), ncols=5, expected={1:"length", 3:"database:"})
> self._params.effective_database_length = _safe_int(
> self._params.effective_database_length)
>
> def effective_search_space(self, line):
> self._params.effective_search_space, = _get_cols(
> line, (-1,), ncols=4, expected={1:"search"})
> self._params.effective_search_space = _safe_int(
> self._params.effective_search_space)
>
> def effective_search_space_used(self, line):
> self._params.effective_search_space_used, = _get_cols(
> line, (-1,), ncols=5, expected={1:"search", 3:"used:"})
> self._params.effective_search_space_used = _safe_int(
> self._params.effective_search_space_used)
>
> def frameshift(self, line):
> self._params.frameshift = _get_cols(
> line, (4, 5), ncols=6, expected={0:"frameshift", 2:"decay"})
>
> def threshold(self, line):
> self._params.threshold, = _get_cols(
> line, (1,), ncols=2, expected={0:"T:"})
> self._params.threshold = _safe_int(self._params.threshold)
>
> def window_size(self, line):
> self._params.window_size, = _get_cols(
> line, (1,), ncols=2, expected={0:"A:"})
> self._params.window_size = _safe_int(self._params.window_size)
>
> def dropoff_1st_pass(self, line):
> score, bits = _re_search(
> r"X1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the dropoff in line\n%s" % line)
> self._params.dropoff_1st_pass = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff(self, line):
> score, bits = _re_search(
> r"X2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff in line\n%s" % line)
> self._params.gap_x_dropoff = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff_final(self, line):
> score, bits = _re_search(
> r"X3: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff final in line\n%s" % line)
> self._params.gap_x_dropoff_final = _safe_int(score), _safe_float(bits)
>
> def gap_trigger(self, line):
> score, bits = _re_search(
> r"S1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap trigger in line\n%s" % line)
> self._params.gap_trigger = _safe_int(score), _safe_float(bits)
>
> def blast_cutoff(self, line):
> score, bits = _re_search(
> r"S2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the blast cutoff in line\n%s" % line)
> self._params.blast_cutoff = _safe_int(score), _safe_float(bits)
>
> def end_parameters(self):
> pass
>
>
> class _BlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> # This Consumer is inherits from many other consumer classes that handle
> # the actual dirty work. An alternate way to do it is to create objects
> # of those classes and then delegate the parsing tasks to them in a
> # decorator-type pattern. The disadvantage of that is that the method
> # names will need to be resolved in this classes. However, using
> # a decorator will retain more control in this class (which may or
> # may not be a bad thing). In addition, having each sub-consumer as
> # its own object prevents this object's dictionary from being cluttered
> # with members and reduces the chance of member collisions.
> def __init__(self):
> self.data = None
>
> def round(self, line):
> # Make sure nobody's trying to pass me PSI-BLAST data!
> raise ValueError, \
> "This consumer doesn't handle PSI-BLAST data"
>
> def start_header(self):
> self.data = Record.Blast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def end_descriptions(self):
> self.data.descriptions = self._descriptions
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self.data.alignments.append(self._alignment)
> if self._multiple_alignment.alignment:
> self.data.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class _PSIBlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> def __init__(self):
> self.data = None
>
> def start_header(self):
> self.data = Record.PSIBlast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def start_descriptions(self):
> self._round = Record.Round()
> self.data.rounds.append(self._round)
> _DescriptionConsumer.start_descriptions(self)
>
> def end_descriptions(self):
> _DescriptionConsumer.end_descriptions(self)
> self._round.number = self._roundnum
> if self._descriptions:
> self._round.new_seqs.extend(self._descriptions)
> self._round.reused_seqs.extend(self._model_sequences)
> self._round.new_seqs.extend(self._nonmodel_sequences)
> if self._converged:
> self.data.converged = 1
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self._round.alignments.append(self._alignment)
> if self._multiple_alignment:
> self._round.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class Iterator:
> """Iterates over a file of multiple BLAST results.
>
> Methods:
> next Return the next record from the stream, or None.
>
> """
> def __init__(self, handle, parser=None):
> """__init__(self, handle, parser=None)
>
> Create a new iterator. handle is a file-like object. parser
> is an optional Parser object to change the results into another form.
> If set to None, then the raw contents of the file will be returned.
>
> """
> try:
> handle.readline
> except AttributeError:
> raise ValueError(
> "I expected a file handle or file-like object, got %s"
> % type(handle))
> self._uhandle = File.UndoHandle(handle)
> self._parser = parser
>
> def next(self):
> """next(self) -> object
>
> Return the next Blast record from the file. If no more records,
> return None.
>
> """
> lines = []
> while 1:
> line = self._uhandle.readline()
> if not line:
> break
> # If I've reached the next one, then put the line back and stop.
> if lines and (line.startswith('BLAST')
> or line.startswith('BLAST', 1)
> or line.startswith(' self._uhandle.saveline(line)
> break
> lines.append(line)
>
> if not lines:
> return None
>
> data = ''.join(lines)
> if self._parser is not None:
> return self._parser.parse(File.StringHandle(data))
> return data
>
> def __iter__(self):
> return iter(self.next, None)
>
> def blastall(blastcmd, program, database, infile, **keywds):
> """blastall(blastcmd, program, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastall. blastcmd is the command
> used to launch the 'blastall' executable. program is the blast program
> to use, e.g. 'blastp', 'blastn', etc. database is the path to the database
> to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastall.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> nuc_match Nucleotide match reward. (BLASTN)
> nuc_mismatch Nucleotide mismatch penalty. (BLASTN)
> query_genetic_code Genetic code for Query.
> db_genetic_code Genetic code for database. (TBLAST[NX])
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F (not for TBLASTX)
> expectation Expectation value cutoff.
> wordsize Word size.
> strands Query strands to search against database.([T]BLAST[NX])
> keep_hits Number of best hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
>
> Processing
> filter Filter query sequence? T/F
> believe_query Believe the query defline. T/F
> restrict_gi Restrict search to these GI's.
> nprocessors Number of processors to use.
> oldengine Force use of old engine [T/F]
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'nuc_match' : '-r',
> 'nuc_mismatch' : '-q',
> 'query_genetic_code' : '-Q',
> 'db_genetic_code' : '-D',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'strands' : '-S',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-z',
> 'search_length' : '-Y',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'restrict_gi' : '-l',
> 'nprocessors' : '-a',
> 'oldengine' : '-V',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastall does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['program'], program])
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def blastpgp(blastcmd, database, infile, **keywds):
> """blastpgp(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastpgp. blastcmd is the command
> used to launch the 'blastpgp' executable. database is the path to the
> database to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastpgp.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> window_size Multiple hits window size.
> npasses Number of passes.
> passes Hits/passes. Integer 0-2.
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F
> expectation Expectation value cutoff.
> wordsize Word size.
> keep_hits Number of beset hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> pseudocounts Pseudocounts constants for multiple passes.
> xdrop_final X dropoff for final gapped alignment.
> xdrop_extension Dropoff for blast extensions.
> model_threshold E-value threshold to include in multipass model.
> required_start Start of required region in query.
> required_end End of required region in query.
>
> Processing
> XXX should document default values
> program The blast program to use. (PHI-BLAST)
> filter Filter query sequence with SEG? T/F
> believe_query Believe the query defline? T/F
> nprocessors Number of processors to use.
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
> checkpoint_outfile Output file for PSI-BLAST checkpointing.
> restart_infile Input file for PSI-BLAST restart.
> hit_infile Hit file for PHI-BLAST.
> matrix_outfile Output file for PSI-BLAST matrix in ASCII.
> align_infile Input alignment file for PSI-BLAST restart.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'window_size' : '-A',
> 'npasses' : '-j',
> 'passes' : '-P',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-Z',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'pseudocounts' : '-c',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'model_threshold' : '-h',
> 'required_start' : '-S',
> 'required_end' : '-H',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o',
> 'checkpoint_outfile' : '-C',
> 'restart_infile' : '-R',
> 'hit_infile' : '-k',
> 'matrix_outfile' : '-Q',
> 'align_infile' : '-B'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastpgp does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def rpsblast(blastcmd, database, infile, align_view="7", **keywds):
> """rpsblast(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from standalone RPS-BLAST. blastcmd is the
> command used to launch the 'rpsblast' executable. database is the path
> to the database to search against. infile is the path to the file
> containing the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by rpsblast.
>
> Please note that this function will give XML output by default, by
> setting align_view to seven (i.e. command line option -m 7).
> You should use the NCBIXML.BlastParser() to read the resulting output.
> This is because NCBIStandalone.BlastParser() does not understand the
> plain text output format from rpsblast.
>
> WARNING - The following text and associated parameter handling has not
> received extensive testing. Please report any errors we might have made...
>
> Algorithm/Scoring
> gapped Whether to do a gapped alignment. T/F
> multihit 0 for multiple hit (default), 1 for single hit
> expectation Expectation value cutoff.
> range_restriction Range restriction on query sequence (Format: start,stop) blastp only
> 0 in 'start' refers to the beginning of the sequence
> 0 in 'stop' refers to the end of the sequence
> Default = 0,0
> xdrop Dropoff value (bits) for gapped alignments.
> xdrop_final X dropoff for final gapped alignment (in bits).
> xdrop_extension Dropoff for blast extensions (in bits).
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> protein Query sequence is protein. T/F
> db_length Effective database length.
>
> Processing
> filter Filter query sequence with SEG? T/F
> case_filter Use lower case filtering of FASTA sequence T/F, default F
> believe_query Believe the query defline. T/F
> nprocessors Number of processors to use.
> logfile Name of log file to use, default rpsblast.log
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-9.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
>
> """
> att2param = {
> 'multihit' : '-P',
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'range_restriction' : '-L',
> 'xdrop' : '-X',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'protein' : '-p',
> 'db_length' : '-z',
>
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'case_filter' : '-U',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
> 'logfile' : '-l',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "rpsblast does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
> params.extend([att2param['align_view'], align_view])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
> def _re_search(regex, line, error_msg):
> m = re.search(regex, line)
> if not m:
> raise SyntaxError, error_msg
> return m.groups()
>
> def _get_cols(line, cols_to_get, ncols=None, expected={}):
> cols = line.split()
>
> # Check to make sure number of columns is correct
> if ncols is not None and len(cols) != ncols:
> raise SyntaxError, "I expected %d columns (got %d) in line\n%s" % \
> (ncols, len(cols), line)
>
> # Check to make sure columns contain the correct data
> for k in expected.keys():
> if cols[k] != expected[k]:
> raise SyntaxError, "I expected '%s' in column %d in line\n%s" % (
> expected[k], k, line)
>
> # Construct the answer tuple
> results = []
> for c in cols_to_get:
> results.append(cols[c])
> return tuple(results)
>
> def _safe_int(str):
> try:
> return int(str)
> except ValueError:
> # Something went wrong. Try to clean up the string.
> # Remove all commas from the string
> str = str.replace(',', '')
> try:
> # try again.
> return int(str)
> except ValueError:
> pass
> # If it fails again, maybe it's too long?
> # XXX why converting to float?
> return long(float(str))
>
> def _safe_float(str):
> # Thomas Rosleff Soerensen (rosleff at mpiz-koeln.mpg.de) noted that
> # float('e-172') does not produce an error on his platform. Thus,
> # we need to check the string for this condition.
>
> # Sometimes BLAST leaves of the '1' in front of an exponent.
> if str and str[0] in ['E', 'e']:
> str = '1' + str
> try:
> return float(str)
> except ValueError:
> # Remove all commas from the string
> str = str.replace(',', '')
> # try again.
> return float(str)
>
> class _BlastErrorConsumer(_BlastConsumer):
> def __init__(self):
> _BlastConsumer.__init__(self)
> def noevent(self, line):
> if line.find("Query must be at least wordsize") != -1:
> raise ShortQueryBlastError, "Query must be at least wordsize"
> # Now pass the line back up to the superclass.
> method = getattr(_BlastConsumer, 'noevent',
> _BlastConsumer.__getattr__(self, 'noevent'))
> method(line)
>
> class BlastErrorParser(AbstractParser):
> """Attempt to catch and diagnose BLAST errors while parsing.
>
> This utilizes the BlastParser module but adds an additional layer
> of complexity on top of it by attempting to diagnose SyntaxError's
> that may actually indicate problems during BLAST parsing.
>
> Current BLAST problems this detects are:
> o LowQualityBlastError - When BLASTing really low quality sequences
> (ie. some GenBank entries which are just short streches of a single
> nucleotide), BLAST will report an error with the sequence and be
> unable to search with this. This will lead to a badly formatted
> BLAST report that the parsers choke on. The parser will convert the
> SyntaxError to a LowQualityBlastError and attempt to provide useful
> information.
>
> """
> def __init__(self, bad_report_handle = None):
> """Initialize a parser that tries to catch BlastErrors.
>
> Arguments:
> o bad_report_handle - An optional argument specifying a handle
> where bad reports should be sent. This would allow you to save
> all of the bad reports to a file, for instance. If no handle
> is specified, the bad reports will not be saved.
> """
> self._bad_report_handle = bad_report_handle
>
> #self._b_parser = BlastParser()
> self._scanner = _Scanner()
> self._consumer = _BlastErrorConsumer()
>
> def parse(self, handle):
> """Parse a handle, attempting to diagnose errors.
> """
> results = handle.read()
>
> try:
> self._scanner.feed(File.StringHandle(results), self._consumer)
> except SyntaxError, msg:
> # if we have a bad_report_file, save the info to it first
> if self._bad_report_handle:
> # send the info to the error handle
> self._bad_report_handle.write(results)
>
> # now we want to try and diagnose the error
> self._diagnose_error(
> File.StringHandle(results), self._consumer.data)
>
> # if we got here we can't figure out the problem
> # so we should pass along the syntax error we got
> raise
> return self._consumer.data
>
> def _diagnose_error(self, handle, data_record):
> """Attempt to diagnose an error in the passed handle.
>
> Arguments:
> o handle - The handle potentially containing the error
> o data_record - The data record partially created by the consumer.
> """
> line = handle.readline()
>
> while line:
> # 'Searchingdone' instead of 'Searching......done' seems
> # to indicate a failure to perform the BLAST due to
> # low quality sequence
> if line.startswith('Searchingdone'):
> raise LowQualityBlastError("Blast failure occured on query: ",
> data_record.query)
> line = handle.readline()
>
>
>
>
> # BLAST XML parsing
> """This module provides code to work with the BLAST XML output
> following the DTD available on the NCBI FTP
> ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/NCBI_BlastOutput.dtd
>
> Classes:
> BlastParser Parses XML output from BLAST.
>
> _XMLParser Generic SAX parser.
> """
> from Bio.Blast import Record
> import xml.sax
> from xml.sax.handler import ContentHandler
>
> class _XMLparser(ContentHandler):
> """Generic SAX Parser
>
> Just a very basic SAX parser.
>
> Redefine the methods startElement, characters and endElement.
> """
> def __init__(self):
> """Constructor
> """
> self._tag = []
> self._value = ''
>
> def _secure_name(self, name):
> """Removes 'dangerous' from tag names
>
> name -- name to be 'secured'
> """
> # Replace '-' with '_' in XML tag names
> return name.replace('-', '_')
>
> def startElement(self, name, attr):
> """Found XML start tag
>
> No real need of attr, BLAST DTD doesn't use them
>
> name -- name of the tag
>
> attr -- tag attributes
> """
> self._tag.append(name)
>
> # Try to call a method
> try:
> eval(self._secure_name('self._start_' + name))()
> except AttributeError:
> # Doesn't exist (yet)
> pass
>
> def characters(self, ch):
> """Found some text
>
> ch -- characters read
> """
> self._value += ch # You don't ever get the whole string
>
> def endElement(self, name):
> """Found XML end tag
>
> name -- tag name
> """
> # Strip character buffer
> self._value = self._value.strip()
>
> # Try to call a method (defined in subclasses)
> try:
> eval(self._secure_name('self._end_' + name))()
> except AttributeError: # Method doesn't exist (yet ?)
> pass
>
> # Reset character buffer
> self._value = ''
>
> class BlastParser(_XMLparser):
> """Parse XML BLAST data into a Record.Blast object
>
> Methods:
> parse Parses BLAST XML data.
>
> All XML 'action' methods are private methods and may be:
> _start_TAG called when the start tag is found
> _end_TAG called when the end tag is found
> """
>
> def __init__(self):
> """Constructor
> """
> # Calling superclass method
> _XMLparser.__init__(self)
>
> self._parser = xml.sax.make_parser()
> self._parser.setContentHandler(self)
>
> # To avoid ValueError: unknown url type: NCBI_BlastOutput.dtd
> self._parser.setFeature(xml.sax.handler.feature_validation, 0)
> self._parser.setFeature(xml.sax.handler.feature_namespaces, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_pes, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_ges, 0)
>
> self._blast = Record.Blast()
>
> def parse(self, handler):
> """Parses the XML data
>
> handler -- file handler or StringIO
> """
> # initialize a new Blast Record
> #self._blast = Record.Blast()
> # FIXME: very slow?
> self._blast.__init__()
>
> # bugfix: changed `filename` to `handler`. Iddo 12/20/2004
> self._parser.parse(handler)
> return self._blast
>
> # Header
> def _end_BlastOutput_program(self):
> """BLAST program, e.g., blastp, blastn, etc.
> """
> self._blast.application = self._value.uuper()
>
> def _end_BlastOutput_version(self):
> """version number of the BLAST engine (e.g., 2.1.2)
> """
> self._blast.version = self._value.split()[1]
> self._blast.date = self._value.split()[2][1:-1]
>
> def _end_BlastOutput_reference(self):
> """a reference to the article describing the algorithm
> """
> self._blast.reference = self._value
>
> def _end_BlastOutput_db(self):
> """the database(s) searched
> """
> self._blast.database = self._value
>
> def _end_BlastOutput_query_ID(self):
> """the identifier of the query
> """
> self._blast.query_id = self._value
>
> def _end_BlastOutput_query_def(self):
> """the definition line of the query
> """
> self._blast.query = self._value
>
> def _end_BlastOutput_query_len(self):
> """the length of the query
> """
> self._blast.query_length = int(self._value)
>
> ## def _end_BlastOutput_query_seq(self):
> ## """the query sequence
> ## """
> ## pass # XXX Missing in Record.Blast ?
>
> ## def _end_BlastOutput_iter_num(self):
> ## """the psi-blast iteration number
> ## """
> ## pass # XXX TODO PSI
>
> # non-existent in blastall 2.2.13 output
> def _end_BlastOutput_hits(self):
> """hits to the database sequences, one for every sequence
> """
> self._blast.num_hits = int(self._value)
>
> ## def _end_BlastOutput_message(self):
> ## """error messages
> ## """
> ## pass # XXX What to do ?
>
> # Parameters
> def _end_Parameters_matrix(self):
> """matrix used (-M)
> """
> self._blast.matrix = self._value
>
> def _end_Parameters_expect(self):
> """expect values cutoff (-e)
> """
> self._blast.expect = self._value
>
> ## def _end_Parameters_include(self):
> ## """inclusion threshold for a psi-blast iteration (-h)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Parameters_sc_match(self):
> """match score for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_match = int(self._value)
>
> def _end_Parameters_sc_mismatch(self):
> """mismatch penalty for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_mismatch = int(self._value)
>
> def _end_Parameters_gap_open(self):
> """gap existence cost (-G)
> """
> self._blast.gap_penalties[0] = int(self._value)
>
> def _end_Parameters_gap_extend(self):
> """gap extension cose (-E)
> """
> self._blast.gap_penalties[1] = int(self._value)
>
> def _end_Parameters_filter(self):
> """filtering options (-F)
> """
> self._blast.filter = self._value
>
> ## def _end_Parameters_pattern(self):
> ## """pattern used for phi-blast search
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Parameters_entrez_query(self):
> ## """entrez query used to limit search
> ## """
> ## pass # XXX TODO PSI
>
> # Hits
> def _start_Hit(self):
> self._blast.alignments.append(Record.Alignment())
> self._blast.descriptions.append(Record.Description())
> self._blast.multiple_alignment = []
> self._hit = self._blast.alignments[-1]
> self._descr = self._blast.descriptions[-1]
> self._descr.num_alignments = 0
>
> # Hit_num is useless
>
> def _end_Hit_id(self):
> """identifier of the matched database sequence
> """
> self._hit.title_id = self._value
> self._descr.title_id = self._value
>
> def _end_Hit_def(self):
> """definition line (title) of the database sequence
> """
> self._hit.title = self._value
> self._descr.title = self._value
>
> # not necessary?
> def _end_Hit_accession(self):
> """accession of the database sequence
> """
> self._hit.accession = self._value
> self._descr.accession = self._value
>
> def _end_Hit_len(self):
> self._hit.length = int(self._value)
>
> # HSPs
> def _start_Hsp(self):
> self._hit.hsps.append(Record.HSP())
> self._hsp = self._hit.hsps[-1]
> self._descr.num_alignments += 1
> self._blast.multiple_alignment.append(Record.MultipleAligment())
> self._mult_al = self._blast.multiple_alignment[-1]
>
> # Hsp_num is useless
> def _end_Hsp_score(self):
> """raw score of HSP
> """
> self._hsp.score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.score == None:
> self._descr.score = float(self._value)
>
> def _end_Hsp_bit_score(self):
> """bit score of HSP
> """
> self._hsp.bit_score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.bit_score == None:
> self._descr.bit_score = float(self._value)
>
> def _end_Hsp_evalue(self):
> """expect value of the HSP
> """
> self._hsp.evalue = float(self._value)
> if self._descr.evalue == None:
> self._descr.evalue = float(self._value)
>
> def _end_Hsp_query_from(self):
> """offset of query at the start of the alignment (one-offset)
> """
> self._hsp.query_start = int(self._value)
>
> def _end_Hsp_query_to(self):
> """offset of query at the end of the alignment (one-offset)
> """
> self._hsp.query_stop = int(self._value)
>
> def _end_Hsp_hit_from(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_start = int(self._value)
>
> def _end_Hsp_hit_to(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_stop = int(self._value)
>
>
> ## def _end_Hsp_pattern_from(self):
> ## """start of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Hsp_pattern_to(self):
> ## """end of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Hsp_query_frame(self):
> """frame of the query if applicable
> """
> self._hsp.frame = (int(self._value),)
>
> def _end_Hsp_hit_frame(self):
> """frame of the database sequence if applicable
> """
> self._hsp.frame += (int(self._value),)
>
> def _end_Hsp_identity(self):
> """number of identities in the alignment
> """
> self._hsp.identities = (int(self._value), None)
>
> def _end_Hsp_positive(self):
> """number of positive (conservative) substitutions in the alignment
> """
> self._hsp.positives = (int(self._value), None)
>
> def _end_Hsp_align_len(self):
> """length of the alignment
> """
> self._hsp.align_length = int(self._value)
>
> def _end_Hsp_gaps(self):
> """number of gaps in the alignment
> """
> self._hsp.gaps = (int(self._value),None)
>
> ## def _en_Hsp_density(self):
> ## """score density
> ## """
> ## pass # XXX ???
>
> def _end_Hsp_qseq(self):
> """alignment string for the query
> """
> self._hsp.query = self._value
>
> def _end_Hsp_hseq(self):
> """alignment string for the database
> """
> self._hsp.sbjct = self._value
>
> def _end_Hsp_midline(self):
> """Formatting middle line as normally seen in BLAST report
> """
> self._hsp.match = self._value
>
> # Statistics
> def _end_Statistics_db_num(self):
> """number of sequences in the database
> """
> self._blast.num_sequences_in_database = int(self._value)
>
> def _end_Statistics_db_len(self):
> """number of letters in the database
> """
> self._blast.num_letters_in_database = int(self._value)
>
> def _end_Statistics_hsp_len(self):
> """the effective HSP length
> """
> self._blast.effective_hsp_length = int(self._value)
>
> def _end_Statistics_eff_space(self):
> """the effective search space
> """
> self._blast.effective_search_space = float(self._value)
>
> def _end_Statistics_kappa(self):
> """Karlin-Altschul parameter K
> """
> self._blast.ka_params[0] = float(self._value)
>
> def _end_Statistics_lambda(self):
> """Karlin-Altschul parameter Lambda
> """
> self._blast.ka_params[1] = float(self._value)
>
> def _end_Statistics_entropy(self):
> """Karlin-Altschul parameter H
> """
> self._blast.ka_params[2] = float(self._value)
>
>
> if __name__ == '__main__':
> import sys
> p = BlastParser()
> r = p.parse(sys.argv[1])
>
> # Small test
> print 'Blast of', r.query
> print 'Found %s alignments with a total of %s HSPs' % (len(r.alignments),
> reduce(lambda a,b: a+b,
> [len(a.hsps) for a in r.alignments]))
>
> for al in r.alignments:
> print al.title[:50], al.length, 'bp', len(al.hsps), 'HSPs'
>
> # Cookbook example
> E_VALUE_THRESH = 0.04
> for alignment in r.alignments:
> for hsp in alignment.hsps:
> if hsp.expect < E_VALUE_THRESH:
> print '*****'
> print 'sequence', alignment.title
> print 'length', alignment.length
> print 'e value', hsp.expect
> print hsp.query[:75] + '...'
> print hsp.match[:75] + '...'
> print hsp.sbjct[:75] + '...'
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
>
> """Record classes to hold BLAST output.
>
> Classes:
> Blast Holds all the information from a blast search.
> PSIBlast Holds all the information from a psi-blast search.
>
> Header Holds information from the header.
> Description Holds information about one hit description.
> Alignment Holds information about one alignment hit.
> HSP Holds information about one HSP.
> MultipleAlignment Holds information about a multiple alignment.
> DatabaseReport Holds information from the database report.
> Parameters Holds information from the parameters.
>
> """
> # XXX finish printable BLAST output
>
> import string
>
> from Bio.Align import Generic
>
> class Header:
> """Saves information from a blast header.
>
> Members:
> application The name of the BLAST flavor that generated this data.
> version Version of blast used.
> date Date this data was generated.
> reference Reference for blast.
>
> query Name of the query sequence.
> query_id Query ID (necessary?) (str)
> query_length Length of the query sequence. (int)
>
> database Name of the database.
> """
> def __init__(self):
> self.application = ''
> self.version = ''
> self.date = ''
> self.reference = ''
>
> self.query = ''
> self.query_id = ''
> self.query_length = None
>
> self.database = ''
>
> class Description:
> """Stores information about one hit in the descriptions section.
>
> Members:
> title Title of the hit.
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> score Number of bits. (int)
> evalue Expect value. (float)
> num_alignments Number of alignments for the same subject. (int)
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.score = None
> self.bit_score = None
> self.evalue = None
> self.num_alignments = None
> def __str__(self):
> return "%-66s %5s %s %s" % (self.title, self.score, self.bit_score, self.evalue)
>
> class Alignment:
> """Stores information about one hit in the alignments section.
>
> Members:
> title Name of the matched sequence. (str)
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> length Length of matched sequence. (int)
> hsps A list of HSP objects.
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.length = None
> self.hsps = []
> def __str__(self):
> lines = []
> titles = string.split(self.title, '\n')
> for i in range(len(titles)):
> if i:
> lines.append(" ")
> lines.append("%s\n" % titles[i])
> lines.append(" Length = %s\n" % self.length)
> return string.join(lines, '')
>
> class HSP:
> """Stores information about one hsp in an alignment hit.
>
> Members:
> score BLAST score of hit. (float)
> bit_score Number of bits for that score. (float)
> evalue Expect value. (float)
> num_alignments Number of alignments for same subject. (int)
> identities Number of identities/total aligned. tuple of (int, int)
> positives Number of positives/total aligned. tuple of (int, int)
> gaps Numer of gaps/total aligned. tuple of (int, int)
> strand Tuple of (query, target) strand.
> frame Tuple of 1 or 2 frame shifts, depending on the flavor.
>
> query The query sequence.
> query_start The start residue for the query sequence. (1-based)
> query_end The end residue for the query sequence. (1-based)
> match The match sequence.
> sbjct The sbjct sequence.
> sbjct_start The start residue for the sbjct sequence. (1-based)
> sbjct_end The end residue for the sbjct sequence. (1-based)
>
> align_length Length of the alignment. (int)
>
> Not all flavors of BLAST return values for every attribute:
> score expect identities positives strand frame
> BLASTP X X X X
> BLASTN X X X X X
> BLASTX X X X X X
> TBLASTN X X X X X
> TBLASTX X X X X X/X
>
> Note: for BLASTX, the query sequence is shown as a protein sequence,
> but the numbering is based on the nucleotides. Thus, the numbering
> is 3x larger than the number of amino acid residues. A similar effect
> can be seen for the sbjct sequence in TBLASTN, and for both sequences
> in TBLASTX.
>
> Also, for negative frames, the sequence numbering starts from
> query_start and counts down.
>
> """
> def __init__(self):
> self.score = None
> self.bit_score = None
> self.expect = None
> self.num_alignments = None
> self.identities = (None, None)
> self.positives = (None, None)
> self.gaps = (None, None)
> self.strand = (None, None)
> self.frame = ()
>
> self.query = ''
> self.query_start = None
> self.query_stop = None
> self.match = ''
> self.sbjct = ''
> self.sbjct_start = None
> self.sbjct_stop = None
>
> self.align_length = None
>
> class MultipleAlignment:
> """Holds information about a multiple alignment.
>
> Members:
> alignment A list of tuples (name, start residue, sequence, end residue).
>
> The start residue is 1-based. It may be blank, if that sequence is
> not aligned in the multiple alignment.
>
> """
> def __init__(self):
> self.alignment = []
>
> def to_generic(self, alphabet):
> """Retrieve generic alignment object for the given alignment.
>
> Instead of the tuples, this returns an Alignment object from
> Bio.Align.Generic, through which you can manipulate and query
> the object.
>
> alphabet is the specified alphabet for the sequences in the code (for
> example IUPAC.IUPACProtein.
>
> Thanks to James Casbon for the code.
> """
> seq_parts = []
> seq_names = []
> parse_number = 0
> n = 0
> for name, start, seq, end in self.alignment:
> if name == 'QUERY': #QUERY is the first in each alignment block
> parse_number = parse_number + 1
> n = 0
>
> if parse_number == 1: # create on first_parse, append on all others
> seq_parts.append(seq)
> seq_names.append(name)
> else:
> seq_parts[n] = seq_parts[n] + seq
> n = n + 1
>
> generic = Generic.Alignment(alphabet)
> for (name,seq) in zip(seq_names,seq_parts):
> generic.add_sequence(name, seq)
>
> return generic
>
> class Round:
> """Holds information from a PSI-BLAST round.
>
> Members:
> number Round number. (int)
> reused_seqs Sequences in model, found again. List of Description objects.
> new_seqs Sequences not found, or below threshold. List of Description.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
>
> """
> def __init__(self):
> self.number = None
> self.reused_seqs = []
> self.new_seqs = []
> self.alignments = []
> self.multiple_alignment = None
>
> class DatabaseReport:
> """Holds information about a database report.
>
> Members:
> database_name List of database names. (can have multiple dbs)
> num_letters_in_database Number of letters in the database. (int)
> num_sequences_in_database List of number of sequences in the database. (int)
> posted_date List of the dates the databases were posted.
> ka_params Array of [lambda, k, h] values. (floats)
> gapped # XXX this isn't set right!
> ka_params_gap Array of [lambda, k, h] values. (floats)
>
> """
> def __init__(self):
> self.database_name = []
> self.posted_date = []
> self.num_letters_in_database = None
> self.num_sequences_in_database = None
> self.ka_params = [None, None, None]
> self.gapped = 0
> self.ka_params_gap = [None, None, None]
>
> class Parameters:
> """Holds information about the parameters.
>
> Members:
> matrix Name of the matrix.
> filter Filter parameter. (str)
> gap_penalties Array of [open, extend] penalties. (floats)
> sc_match Match score for nucleotide-nucleotide comparison
> sc_mismatch Mismatch penalty for nucleotide-nucleotide comparison
> num_hits Number of hits to the database. (int)
> num_sequences Number of sequences. (int)
> num_good_extends Number of extensions. (int)
> expect Expectation value. (int)
> hsps_no_gap Number of HSP's better, without gapping. (int)
> hsps_prelim_gapped Number of HSP's gapped in prelim test. (int)
> hsps_prelim_gapped_attemped Number of HSP's attempted in prelim. (int)
> hsps_gapped Total number of HSP's gapped. (int)
> query_length Length of the query. (int)
> database_length Number of letters in the database. (int)
> effective_hsp_length Effective HSP length. (int)
> effective_query_length Effective length of query. (int)
> effective_database_length Effective length of database. (int)
> effective_search_space Effective search space. (int)
> effective_search_space_used Effective search space used. (int)
> frameshift Frameshift window. Tuple of (int, float)
> threshold Threshold. (int)
> window_size Window size. (int)
> dropoff_1st_pass Tuple of (score, bits). (int, float)
> gap_x_dropoff Tuple of (score, bits). (int, float)
> gap_x_dropoff_final Tuple of (score, bits). (int, float)
> gap_trigger Tuple of (score, bits). (int, float)
> blast_cutoff Tuple of (score, bits). (int, float)
> """
> def __init__(self):
> self.matrix = ''
> self.filter = ''
> self.gap_penalties = [None, None]
> self.sc_match = None
> self.sc_mismatch = None
> self.num_hits = None
> self.num_sequences = None
> self.num_good_extends = None
> self.expect = None
> self.hsps_no_gap = None
> self.hsps_prelim_gapped = None
> self.hsps_prelim_gapped_attemped = None
> self.hsps_gapped = None
> self.query_length = None
> self.database_length = None
> self.effective_hsp_length = None
> self.effective_query_length = None
> self.effective_database_length = None
> self.effective_search_space = None
> self.effective_search_space_used = None
> self.frameshift = (None, None)
> self.threshold = None
> self.window_size = None
> self.dropoff_1st_pass = (None, None)
> self.gap_x_dropoff = (None, None)
> self.gap_x_dropoff_final = (None, None)
> self.gap_trigger = (None, None)
> self.blast_cutoff = (None, None)
>
> class Blast(Header, DatabaseReport, Parameters):
> """Saves the results from a blast search.
>
> Members:
> descriptions A list of Description objects.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.descriptions = []
> self.alignments = []
> self.multiple_alignment = None
>
> class PSIBlast(Header, DatabaseReport, Parameters):
> """Saves the results from a blastpgp search.
>
> Members:
> rounds A list of Round objects.
> converged Whether the search converged.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.rounds = []
> self.converged = 0
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
From jmjoseph at andrew.cmu.edu Wed Jul 19 15:18:12 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Wed, 19 Jul 2006 15:18:12 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk>
<44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
Message-ID: <44BE8574.1020205@andrew.cmu.edu>
I do not believe the current version of the parser will work with
multiple queries using recent version of blast, regardless of the output
format. I do know that blastall 2.2.13 with XML functions with the
parser corrections previously attached. I have attached a further
updated NCBIXML.py, fixing the performance issues in parse() that I
mentioned.
-Jacob
Rohini Damle wrote:
> Hi,
> Can someone suggest me for which version of Blast, the Biopython's
> (text or xml) parser works fine?
> I will download that blast version locally and can use biopython's parser.
> thanx,
> Rohini
>
> On 7/18/06, Jacob Joseph wrote:
>> Hi.
>> I encountered similar difficulties over the past few days myself and
>> have made some improvements to the XML parser. Well, that is, it now
>> functions with blastall, but I have made no effort to parse the other
>> blast programs. I do not expect I have done any harm to other parsing,
>> however.
>>
>> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
>> yet spent significant time to clean up my changes. Without getting into
>> specific modifications, I have made an effort to make consistent the
>> variables in Record and NCBIXML, focusing primarily on what I needed
>> this week.
>>
>> One portion I am not settled on reinitialization of Record.Blast at
>> every call to iterator.next(), and, by extension, BlastParser.parse().
>> See NCBIXML.py, line 114. Without re-initializing this class, we run
>> the risk of retaining portions of a Record from previously parsed
>> queries. This causes the bug 1970, mentioned below. Unfortunately,
>> this re-initialization exacts a significant performance penalty of at
>> least a factor of 10 by some rough measures. I would appreciate any
>> suggestions for improvement here.
>>
>> I do apologize for not being more specific about my changes. When I get
>> a chance(next week?), I will package them up as a proper patch and file
>> a bug. Perhaps what I have done so far will be of use until then.
>>
>> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
>> not have separate blocks within its output, requiring a different
>> method of iteration.
>>
>> -Jacob
>>
>> Peter wrote:
>> > Rohini Damle wrote:
>> >> Hi,
>> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> >> I am trying to extract alignment information for each of them.
>> >> So I wrote the following code:
>> >>
>> >> for b_record in b_iterator :
>> >>
>> >> E_VALUE_THRESH =20
>> >> for alignment in b_record.alignments:
>> >> for hsp in alignment.hsps:
>> >> if hsp.expect< E_VALUE_THRESH:
>> >>
>> >> print '****Alignment****'
>> >> print 'sequence:',
>> alignment.title.split()[0]
>> >>
>> >> With this code, I am getting information for P1,
>> >> then information for P1 + P2
>> >> then for P1+P2 +P3
>> >> and finally for P1+P2+P3+P4
>> >> why this is so?
>> >> is there something wrong with the looping?
>> >
>> > I'm aware of something funny with the XML parsing, Bug 1970, which
>> might
>> > well be the same issue:
>> >
>> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> >
>> > I confess I haven't looked into exactly what is going wrong here - too
>> > many other demands on my time to learn about XML and how BioPython
>> > parses it.
>> >
>> > Does the work around on the bug report help? Depending on which
>> version
>> > of standalone blast you have installed, you might have better luck with
>> > plain text output - the trouble is this is a moving target and the NBCI
>> > keeps tweaking it.
>> >
>> > Peter
>> >
>> > _______________________________________________
>> > BioPython mailing list - BioPython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NCBIXML.py.gz
Type: application/x-gzip
Size: 3209 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060719/a4f5167e/attachment.gz
From rohini.damle at gmail.com Thu Jul 20 12:39:12 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 09:39:12 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BE8574.1020205@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
Message-ID:
Hi,
When I tried on your NCBIXML.py code instead of oringinal one I am
getting following error messege:
File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
in _end_Parameters_gap_open
self._blast.gap_penalties[0] = int(self._value)
TypeError: object does not support item assignment
in the original version
we don't have that " [0] " in self._blast.gap_penalties
what might be causing this error?
-Rohini
On 7/19/06, Jacob Joseph wrote:
> I do not believe the current version of the parser will work with
> multiple queries using recent version of blast, regardless of the output
> format. I do know that blastall 2.2.13 with XML functions with the
> parser corrections previously attached. I have attached a further
> updated NCBIXML.py, fixing the performance issues in parse() that I
> mentioned.
>
> -Jacob
>
> Rohini Damle wrote:
> > Hi,
> > Can someone suggest me for which version of Blast, the Biopython's
> > (text or xml) parser works fine?
> > I will download that blast version locally and can use biopython's parser.
> > thanx,
> > Rohini
> >
> > On 7/18/06, Jacob Joseph wrote:
> >> Hi.
> >> I encountered similar difficulties over the past few days myself and
> >> have made some improvements to the XML parser. Well, that is, it now
> >> functions with blastall, but I have made no effort to parse the other
> >> blast programs. I do not expect I have done any harm to other parsing,
> >> however.
> >>
> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> >> yet spent significant time to clean up my changes. Without getting into
> >> specific modifications, I have made an effort to make consistent the
> >> variables in Record and NCBIXML, focusing primarily on what I needed
> >> this week.
> >>
> >> One portion I am not settled on reinitialization of Record.Blast at
> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> >> the risk of retaining portions of a Record from previously parsed
> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> >> this re-initialization exacts a significant performance penalty of at
> >> least a factor of 10 by some rough measures. I would appreciate any
> >> suggestions for improvement here.
> >>
> >> I do apologize for not being more specific about my changes. When I get
> >> a chance(next week?), I will package them up as a proper patch and file
> >> a bug. Perhaps what I have done so far will be of use until then.
> >>
> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> >> not have separate blocks within its output, requiring a different
> >> method of iteration.
> >>
> >> -Jacob
> >>
> >> Peter wrote:
> >> > Rohini Damle wrote:
> >> >> Hi,
> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> >> >> I am trying to extract alignment information for each of them.
> >> >> So I wrote the following code:
> >> >>
> >> >> for b_record in b_iterator :
> >> >>
> >> >> E_VALUE_THRESH =20
> >> >> for alignment in b_record.alignments:
> >> >> for hsp in alignment.hsps:
> >> >> if hsp.expect< E_VALUE_THRESH:
> >> >>
> >> >> print '****Alignment****'
> >> >> print 'sequence:',
> >> alignment.title.split()[0]
> >> >>
> >> >> With this code, I am getting information for P1,
> >> >> then information for P1 + P2
> >> >> then for P1+P2 +P3
> >> >> and finally for P1+P2+P3+P4
> >> >> why this is so?
> >> >> is there something wrong with the looping?
> >> >
> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> >> might
> >> > well be the same issue:
> >> >
> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >> >
> >> > I confess I haven't looked into exactly what is going wrong here - too
> >> > many other demands on my time to learn about XML and how BioPython
> >> > parses it.
> >> >
> >> > Does the work around on the bug report help? Depending on which
> >> version
> >> > of standalone blast you have installed, you might have better luck with
> >> > plain text output - the trouble is this is a moving target and the NBCI
> >> > keeps tweaking it.
> >> >
> >> > Peter
> >> >
> >> > _______________________________________________
> >> > BioPython mailing list - BioPython at lists.open-bio.org
> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>
From jmjoseph at andrew.cmu.edu Thu Jul 20 14:08:15 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Thu, 20 Jul 2006 14:08:15 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
Message-ID: <44BFC68F.8030802@andrew.cmu.edu>
Hi. I suspect you are not using my updated Record.py. You'll notice
that, at least for the moment, I have changed _blast.gap_penalties to an
array to allow assignment per item without worrying about the order of
entries within the xml file. There are other ways this could be
accomplished while still using a tuple.
-Jacob
Rohini Damle wrote:
> Hi,
> When I tried on your NCBIXML.py code instead of oringinal one I am
> getting following error messege:
>
> File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> in _end_Parameters_gap_open
> self._blast.gap_penalties[0] = int(self._value)
> TypeError: object does not support item assignment
>
> in the original version
> we don't have that " [0] " in self._blast.gap_penalties
>
> what might be causing this error?
> -Rohini
>
> On 7/19/06, Jacob Joseph wrote:
>> I do not believe the current version of the parser will work with
>> multiple queries using recent version of blast, regardless of the output
>> format. I do know that blastall 2.2.13 with XML functions with the
>> parser corrections previously attached. I have attached a further
>> updated NCBIXML.py, fixing the performance issues in parse() that I
>> mentioned.
>>
>> -Jacob
>>
>> Rohini Damle wrote:
>> > Hi,
>> > Can someone suggest me for which version of Blast, the Biopython's
>> > (text or xml) parser works fine?
>> > I will download that blast version locally and can use biopython's
>> parser.
>> > thanx,
>> > Rohini
>> >
>> > On 7/18/06, Jacob Joseph wrote:
>> >> Hi.
>> >> I encountered similar difficulties over the past few days myself and
>> >> have made some improvements to the XML parser. Well, that is, it now
>> >> functions with blastall, but I have made no effort to parse the other
>> >> blast programs. I do not expect I have done any harm to other
>> parsing,
>> >> however.
>> >>
>> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
>> >> yet spent significant time to clean up my changes. Without getting
>> into
>> >> specific modifications, I have made an effort to make consistent the
>> >> variables in Record and NCBIXML, focusing primarily on what I needed
>> >> this week.
>> >>
>> >> One portion I am not settled on reinitialization of Record.Blast at
>> >> every call to iterator.next(), and, by extension, BlastParser.parse().
>> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
>> >> the risk of retaining portions of a Record from previously parsed
>> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
>> >> this re-initialization exacts a significant performance penalty of at
>> >> least a factor of 10 by some rough measures. I would appreciate any
>> >> suggestions for improvement here.
>> >>
>> >> I do apologize for not being more specific about my changes. When
>> I get
>> >> a chance(next week?), I will package them up as a proper patch and
>> file
>> >> a bug. Perhaps what I have done so far will be of use until then.
>> >>
>> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
>> >> not have separate blocks within its output, requiring a
>> different
>> >> method of iteration.
>> >>
>> >> -Jacob
>> >>
>> >> Peter wrote:
>> >> > Rohini Damle wrote:
>> >> >> Hi,
>> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
>> P4)
>> >> >> I am trying to extract alignment information for each of them.
>> >> >> So I wrote the following code:
>> >> >>
>> >> >> for b_record in b_iterator :
>> >> >>
>> >> >> E_VALUE_THRESH =20
>> >> >> for alignment in b_record.alignments:
>> >> >> for hsp in alignment.hsps:
>> >> >> if hsp.expect< E_VALUE_THRESH:
>> >> >>
>> >> >> print '****Alignment****'
>> >> >> print 'sequence:',
>> >> alignment.title.split()[0]
>> >> >>
>> >> >> With this code, I am getting information for P1,
>> >> >> then information for P1 + P2
>> >> >> then for P1+P2 +P3
>> >> >> and finally for P1+P2+P3+P4
>> >> >> why this is so?
>> >> >> is there something wrong with the looping?
>> >> >
>> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
>> >> might
>> >> > well be the same issue:
>> >> >
>> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> >> >
>> >> > I confess I haven't looked into exactly what is going wrong here
>> - too
>> >> > many other demands on my time to learn about XML and how BioPython
>> >> > parses it.
>> >> >
>> >> > Does the work around on the bug report help? Depending on which
>> >> version
>> >> > of standalone blast you have installed, you might have better
>> luck with
>> >> > plain text output - the trouble is this is a moving target and
>> the NBCI
>> >> > keeps tweaking it.
>> >> >
>> >> > Peter
>> >> >
>> >> > _______________________________________________
>> >> > BioPython mailing list - BioPython at lists.open-bio.org
>> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list - BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>>
From rohini.damle at gmail.com Thu Jul 20 14:59:52 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 11:59:52 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BFC68F.8030802@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID:
Hi,
Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
(all updated)
I am not getting that previous error.
BUT I am still not getting the desired output ...
Here is my code
blast_out = open("C:/Documents and Settings/rdamle/My
Documents/Rohini's Documents/Blast
Parsing/onlymouse4proteinblastout.xml", "r")
b_parser = NCBIXML.BlastParser()
b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
E_VALUE_THRESH = 22
for b_record in b_iterator :
for alignment in b_record.alignments:
for hsp in alignment.hsps:
if (hsp.expect< E_VALUE_THRESH):
print b_record.query.split()[0]
print '****Alignment****'
print 'sequence:',
alignment.title.split()[0]
with this code I was expecting to get all the alignments with
hsp.expect wrote:
> Hi. I suspect you are not using my updated Record.py. You'll notice
> that, at least for the moment, I have changed _blast.gap_penalties to an
> array to allow assignment per item without worrying about the order of
> entries within the xml file. There are other ways this could be
> accomplished while still using a tuple.
>
> -Jacob
>
> Rohini Damle wrote:
> > Hi,
> > When I tried on your NCBIXML.py code instead of oringinal one I am
> > getting following error messege:
> >
> > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> > in _end_Parameters_gap_open
> > self._blast.gap_penalties[0] = int(self._value)
> > TypeError: object does not support item assignment
> >
> > in the original version
> > we don't have that " [0] " in self._blast.gap_penalties
> >
> > what might be causing this error?
> > -Rohini
> >
> > On 7/19/06, Jacob Joseph wrote:
> >> I do not believe the current version of the parser will work with
> >> multiple queries using recent version of blast, regardless of the output
> >> format. I do know that blastall 2.2.13 with XML functions with the
> >> parser corrections previously attached. I have attached a further
> >> updated NCBIXML.py, fixing the performance issues in parse() that I
> >> mentioned.
> >>
> >> -Jacob
> >>
> >> Rohini Damle wrote:
> >> > Hi,
> >> > Can someone suggest me for which version of Blast, the Biopython's
> >> > (text or xml) parser works fine?
> >> > I will download that blast version locally and can use biopython's
> >> parser.
> >> > thanx,
> >> > Rohini
> >> >
> >> > On 7/18/06, Jacob Joseph wrote:
> >> >> Hi.
> >> >> I encountered similar difficulties over the past few days myself and
> >> >> have made some improvements to the XML parser. Well, that is, it now
> >> >> functions with blastall, but I have made no effort to parse the other
> >> >> blast programs. I do not expect I have done any harm to other
> >> parsing,
> >> >> however.
> >> >>
> >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> >> >> yet spent significant time to clean up my changes. Without getting
> >> into
> >> >> specific modifications, I have made an effort to make consistent the
> >> >> variables in Record and NCBIXML, focusing primarily on what I needed
> >> >> this week.
> >> >>
> >> >> One portion I am not settled on reinitialization of Record.Blast at
> >> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> >> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> >> >> the risk of retaining portions of a Record from previously parsed
> >> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> >> >> this re-initialization exacts a significant performance penalty of at
> >> >> least a factor of 10 by some rough measures. I would appreciate any
> >> >> suggestions for improvement here.
> >> >>
> >> >> I do apologize for not being more specific about my changes. When
> >> I get
> >> >> a chance(next week?), I will package them up as a proper patch and
> >> file
> >> >> a bug. Perhaps what I have done so far will be of use until then.
> >> >>
> >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> >> >> not have separate blocks within its output, requiring a
> >> different
> >> >> method of iteration.
> >> >>
> >> >> -Jacob
> >> >>
> >> >> Peter wrote:
> >> >> > Rohini Damle wrote:
> >> >> >> Hi,
> >> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
> >> P4)
> >> >> >> I am trying to extract alignment information for each of them.
> >> >> >> So I wrote the following code:
> >> >> >>
> >> >> >> for b_record in b_iterator :
> >> >> >>
> >> >> >> E_VALUE_THRESH =20
> >> >> >> for alignment in b_record.alignments:
> >> >> >> for hsp in alignment.hsps:
> >> >> >> if hsp.expect< E_VALUE_THRESH:
> >> >> >>
> >> >> >> print '****Alignment****'
> >> >> >> print 'sequence:',
> >> >> alignment.title.split()[0]
> >> >> >>
> >> >> >> With this code, I am getting information for P1,
> >> >> >> then information for P1 + P2
> >> >> >> then for P1+P2 +P3
> >> >> >> and finally for P1+P2+P3+P4
> >> >> >> why this is so?
> >> >> >> is there something wrong with the looping?
> >> >> >
> >> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> >> >> might
> >> >> > well be the same issue:
> >> >> >
> >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >> >> >
> >> >> > I confess I haven't looked into exactly what is going wrong here
> >> - too
> >> >> > many other demands on my time to learn about XML and how BioPython
> >> >> > parses it.
> >> >> >
> >> >> > Does the work around on the bug report help? Depending on which
> >> >> version
> >> >> > of standalone blast you have installed, you might have better
> >> luck with
> >> >> > plain text output - the trouble is this is a moving target and
> >> the NBCI
> >> >> > keeps tweaking it.
> >> >> >
> >> >> > Peter
> >> >> >
> >> >> > _______________________________________________
> >> >> > BioPython mailing list - BioPython at lists.open-bio.org
> >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >>
> >>
> >> _______________________________________________
> >> BioPython mailing list - BioPython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >>
> >>
> >>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
From rohini.damle at gmail.com Thu Jul 20 15:05:14 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 12:05:14 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID:
Hi,
I used hsp.evalue instead of hsp.expect and I am getting the desired output.
Thank you very much for your help, efforts, and all those modified files.
Rohini
On 7/20/06, Rohini Damle wrote:
> Hi,
> Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
> (all updated)
> I am not getting that previous error.
> BUT I am still not getting the desired output ...
> Here is my code
>
> blast_out = open("C:/Documents and Settings/rdamle/My
> Documents/Rohini's Documents/Blast
> Parsing/onlymouse4proteinblastout.xml", "r")
>
> b_parser = NCBIXML.BlastParser()
> b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
> E_VALUE_THRESH = 22
>
> for b_record in b_iterator :
> for alignment in b_record.alignments:
> for hsp in alignment.hsps:
> if (hsp.expect< E_VALUE_THRESH):
> print b_record.query.split()[0]
> print '****Alignment****'
> print 'sequence:',
> alignment.title.split()[0]
>
>
> with this code I was expecting to get all the alignments with
> hsp.expect
> BUT I AM GETTING ALL the alignments not just the one with evalue <22
> -Rohini.
>
>
>
>
>
> On 7/20/06, Jacob Joseph wrote:
> > Hi. I suspect you are not using my updated Record.py. You'll notice
> > that, at least for the moment, I have changed _blast.gap_penalties to an
> > array to allow assignment per item without worrying about the order of
> > entries within the xml file. There are other ways this could be
> > accomplished while still using a tuple.
> >
> > -Jacob
> >
> > Rohini Damle wrote:
> > > Hi,
> > > When I tried on your NCBIXML.py code instead of oringinal one I am
> > > getting following error messege:
> > >
> > > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> > > in _end_Parameters_gap_open
> > > self._blast.gap_penalties[0] = int(self._value)
> > > TypeError: object does not support item assignment
> > >
> > > in the original version
> > > we don't have that " [0] " in self._blast.gap_penalties
> > >
> > > what might be causing this error?
> > > -Rohini
> > >
> > > On 7/19/06, Jacob Joseph wrote:
> > >> I do not believe the current version of the parser will work with
> > >> multiple queries using recent version of blast, regardless of the output
> > >> format. I do know that blastall 2.2.13 with XML functions with the
> > >> parser corrections previously attached. I have attached a further
> > >> updated NCBIXML.py, fixing the performance issues in parse() that I
> > >> mentioned.
> > >>
> > >> -Jacob
> > >>
> > >> Rohini Damle wrote:
> > >> > Hi,
> > >> > Can someone suggest me for which version of Blast, the Biopython's
> > >> > (text or xml) parser works fine?
> > >> > I will download that blast version locally and can use biopython's
> > >> parser.
> > >> > thanx,
> > >> > Rohini
> > >> >
> > >> > On 7/18/06, Jacob Joseph wrote:
> > >> >> Hi.
> > >> >> I encountered similar difficulties over the past few days myself and
> > >> >> have made some improvements to the XML parser. Well, that is, it now
> > >> >> functions with blastall, but I have made no effort to parse the other
> > >> >> blast programs. I do not expect I have done any harm to other
> > >> parsing,
> > >> >> however.
> > >> >>
> > >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> > >> >> yet spent significant time to clean up my changes. Without getting
> > >> into
> > >> >> specific modifications, I have made an effort to make consistent the
> > >> >> variables in Record and NCBIXML, focusing primarily on what I needed
> > >> >> this week.
> > >> >>
> > >> >> One portion I am not settled on reinitialization of Record.Blast at
> > >> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> > >> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> > >> >> the risk of retaining portions of a Record from previously parsed
> > >> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> > >> >> this re-initialization exacts a significant performance penalty of at
> > >> >> least a factor of 10 by some rough measures. I would appreciate any
> > >> >> suggestions for improvement here.
> > >> >>
> > >> >> I do apologize for not being more specific about my changes. When
> > >> I get
> > >> >> a chance(next week?), I will package them up as a proper patch and
> > >> file
> > >> >> a bug. Perhaps what I have done so far will be of use until then.
> > >> >>
> > >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> > >> >> not have separate blocks within its output, requiring a
> > >> different
> > >> >> method of iteration.
> > >> >>
> > >> >> -Jacob
> > >> >>
> > >> >> Peter wrote:
> > >> >> > Rohini Damle wrote:
> > >> >> >> Hi,
> > >> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
> > >> P4)
> > >> >> >> I am trying to extract alignment information for each of them.
> > >> >> >> So I wrote the following code:
> > >> >> >>
> > >> >> >> for b_record in b_iterator :
> > >> >> >>
> > >> >> >> E_VALUE_THRESH =20
> > >> >> >> for alignment in b_record.alignments:
> > >> >> >> for hsp in alignment.hsps:
> > >> >> >> if hsp.expect< E_VALUE_THRESH:
> > >> >> >>
> > >> >> >> print '****Alignment****'
> > >> >> >> print 'sequence:',
> > >> >> alignment.title.split()[0]
> > >> >> >>
> > >> >> >> With this code, I am getting information for P1,
> > >> >> >> then information for P1 + P2
> > >> >> >> then for P1+P2 +P3
> > >> >> >> and finally for P1+P2+P3+P4
> > >> >> >> why this is so?
> > >> >> >> is there something wrong with the looping?
> > >> >> >
> > >> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> > >> >> might
> > >> >> > well be the same issue:
> > >> >> >
> > >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> > >> >> >
> > >> >> > I confess I haven't looked into exactly what is going wrong here
> > >> - too
> > >> >> > many other demands on my time to learn about XML and how BioPython
> > >> >> > parses it.
> > >> >> >
> > >> >> > Does the work around on the bug report help? Depending on which
> > >> >> version
> > >> >> > of standalone blast you have installed, you might have better
> > >> luck with
> > >> >> > plain text output - the trouble is this is a moving target and
> > >> the NBCI
> > >> >> > keeps tweaking it.
> > >> >> >
> > >> >> > Peter
> > >> >> >
> > >> >> > _______________________________________________
> > >> >> > BioPython mailing list - BioPython at lists.open-bio.org
> > >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> BioPython mailing list - BioPython at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/biopython
> > >>
> > >>
> > >>
> > >>
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
From jmjoseph at andrew.cmu.edu Thu Jul 20 16:05:14 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Thu, 20 Jul 2006 16:05:14 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID: <44BFE1FA.4040500@andrew.cmu.edu>
Great!
Can someone point me to the current maintainer of the Blast parsing package?
-Jacob
Rohini Damle wrote:
> Hi,
> I used hsp.evalue instead of hsp.expect and I am getting the desired
> output.
>
> Thank you very much for your help, efforts, and all those modified files.
> Rohini
>
> On 7/20/06, Rohini Damle wrote:
>> Hi,
>> Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
>> (all updated)
>> I am not getting that previous error.
>> BUT I am still not getting the desired output ...
>> Here is my code
>>
>> blast_out = open("C:/Documents and Settings/rdamle/My
>> Documents/Rohini's Documents/Blast
>> Parsing/onlymouse4proteinblastout.xml", "r")
>>
>> b_parser = NCBIXML.BlastParser()
>> b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
>> E_VALUE_THRESH = 22
>>
>> for b_record in b_iterator :
>> for alignment in b_record.alignments:
>> for hsp in alignment.hsps:
>> if (hsp.expect< E_VALUE_THRESH):
>> print b_record.query.split()[0]
>> print '****Alignment****'
>> print 'sequence:',
>> alignment.title.split()[0]
>>
>>
>> with this code I was expecting to get all the alignments with
>> hsp.expect>
>> BUT I AM GETTING ALL the alignments not just the one with evalue <22
>> -Rohini.
>>
>>
>>
>>
>>
>> On 7/20/06, Jacob Joseph wrote:
>> > Hi. I suspect you are not using my updated Record.py. You'll notice
>> > that, at least for the moment, I have changed _blast.gap_penalties
>> to an
>> > array to allow assignment per item without worrying about the order of
>> > entries within the xml file. There are other ways this could be
>> > accomplished while still using a tuple.
>> >
>> > -Jacob
>> >
>> > Rohini Damle wrote:
>> > > Hi,
>> > > When I tried on your NCBIXML.py code instead of oringinal one I am
>> > > getting following error messege:
>> > >
>> > > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
>> > > in _end_Parameters_gap_open
>> > > self._blast.gap_penalties[0] = int(self._value)
>> > > TypeError: object does not support item assignment
>> > >
>> > > in the original version
>> > > we don't have that " [0] " in self._blast.gap_penalties
>> > >
>> > > what might be causing this error?
>> > > -Rohini
>> > >
>> > > On 7/19/06, Jacob Joseph wrote:
>> > >> I do not believe the current version of the parser will work with
>> > >> multiple queries using recent version of blast, regardless of the
>> output
>> > >> format. I do know that blastall 2.2.13 with XML functions with the
>> > >> parser corrections previously attached. I have attached a further
>> > >> updated NCBIXML.py, fixing the performance issues in parse() that I
>> > >> mentioned.
>> > >>
>> > >> -Jacob
>> > >>
>> > >> Rohini Damle wrote:
>> > >> > Hi,
>> > >> > Can someone suggest me for which version of Blast, the Biopython's
>> > >> > (text or xml) parser works fine?
>> > >> > I will download that blast version locally and can use biopython's
>> > >> parser.
>> > >> > thanx,
>> > >> > Rohini
>> > >> >
>> > >> > On 7/18/06, Jacob Joseph wrote:
>> > >> >> Hi.
>> > >> >> I encountered similar difficulties over the past few days
>> myself and
>> > >> >> have made some improvements to the XML parser. Well, that is,
>> it now
>> > >> >> functions with blastall, but I have made no effort to parse
>> the other
>> > >> >> blast programs. I do not expect I have done any harm to other
>> > >> parsing,
>> > >> >> however.
>> > >> >>
>> > >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I
>> have not
>> > >> >> yet spent significant time to clean up my changes. Without
>> getting
>> > >> into
>> > >> >> specific modifications, I have made an effort to make
>> consistent the
>> > >> >> variables in Record and NCBIXML, focusing primarily on what I
>> needed
>> > >> >> this week.
>> > >> >>
>> > >> >> One portion I am not settled on reinitialization of
>> Record.Blast at
>> > >> >> every call to iterator.next(), and, by extension,
>> BlastParser.parse().
>> > >> >> See NCBIXML.py, line 114. Without re-initializing this class,
>> we run
>> > >> >> the risk of retaining portions of a Record from previously parsed
>> > >> >> queries. This causes the bug 1970, mentioned below.
>> Unfortunately,
>> > >> >> this re-initialization exacts a significant performance
>> penalty of at
>> > >> >> least a factor of 10 by some rough measures. I would
>> appreciate any
>> > >> >> suggestions for improvement here.
>> > >> >>
>> > >> >> I do apologize for not being more specific about my changes.
>> When
>> > >> I get
>> > >> >> a chance(next week?), I will package them up as a proper patch
>> and
>> > >> file
>> > >> >> a bug. Perhaps what I have done so far will be of use until
>> then.
>> > >> >>
>> > >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14
>> seems to
>> > >> >> not have separate blocks within its output, requiring a
>> > >> different
>> > >> >> method of iteration.
>> > >> >>
>> > >> >> -Jacob
>> > >> >>
>> > >> >> Peter wrote:
>> > >> >> > Rohini Damle wrote:
>> > >> >> >> Hi,
>> > >> >> >> I have a XML file with 4 blast records (for proteins P1,
>> P2, P3,
>> > >> P4)
>> > >> >> >> I am trying to extract alignment information for each of them.
>> > >> >> >> So I wrote the following code:
>> > >> >> >>
>> > >> >> >> for b_record in b_iterator :
>> > >> >> >>
>> > >> >> >> E_VALUE_THRESH =20
>> > >> >> >> for alignment in b_record.alignments:
>> > >> >> >> for hsp in alignment.hsps:
>> > >> >> >> if hsp.expect< E_VALUE_THRESH:
>> > >> >> >>
>> > >> >> >> print '****Alignment****'
>> > >> >> >> print 'sequence:',
>> > >> >> alignment.title.split()[0]
>> > >> >> >>
>> > >> >> >> With this code, I am getting information for P1,
>> > >> >> >> then information for P1 + P2
>> > >> >> >> then for P1+P2 +P3
>> > >> >> >> and finally for P1+P2+P3+P4
>> > >> >> >> why this is so?
>> > >> >> >> is there something wrong with the looping?
>> > >> >> >
>> > >> >> > I'm aware of something funny with the XML parsing, Bug 1970,
>> which
>> > >> >> might
>> > >> >> > well be the same issue:
>> > >> >> >
>> > >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> > >> >> >
>> > >> >> > I confess I haven't looked into exactly what is going wrong
>> here
>> > >> - too
>> > >> >> > many other demands on my time to learn about XML and how
>> BioPython
>> > >> >> > parses it.
>> > >> >> >
>> > >> >> > Does the work around on the bug report help? Depending on
>> which
>> > >> >> version
>> > >> >> > of standalone blast you have installed, you might have better
>> > >> luck with
>> > >> >> > plain text output - the trouble is this is a moving target and
>> > >> the NBCI
>> > >> >> > keeps tweaking it.
>> > >> >> >
>> > >> >> > Peter
>> > >> >> >
>> > >> >> > _______________________________________________
>> > >> >> > BioPython mailing list - BioPython at lists.open-bio.org
>> > >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>> > >>
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> BioPython mailing list - BioPython at lists.open-bio.org
>> > >> http://lists.open-bio.org/mailman/listinfo/biopython
>> > >>
>> > >>
>> > >>
>> > >>
>> > _______________________________________________
>> > BioPython mailing list - BioPython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
From karbak at gmail.com Sun Jul 23 02:12:51 2006
From: karbak at gmail.com (K. Arun)
Date: Sun, 23 Jul 2006 02:12:51 -0400
Subject: [BioPython] KDTree optional compilation
Message-ID: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>
Hello,
I just updated to version 1.42 (thanks !), and noticed I had to
uncomment the KDTree entry in setup.py in the NUMPY_PACKAGES section
to use Bio.PDB.NeighborSearch. Since KDTree is essential to this
module, would it possible to add a note about this in the README, or
print a message during compilation to the effect that setup.py needs
to be modified ?
-arun
From mdehoon at c2b2.columbia.edu Sun Jul 23 12:11:52 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 23 Jul 2006 12:11:52 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BFE1FA.4040500@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
Message-ID: <44C39FC8.3060002@c2b2.columbia.edu>
Jacob Joseph wrote:
> Great!
>
> Can someone point me to the current maintainer of the Blast parsing package?
>
> -Jacob
>
As far as I know, we don't have an official maintainer of the Blast
package. So your best bet is to submit a patch through bugzilla.
--Michiel.
From jdiezperezj at gmail.com Mon Jul 24 11:01:49 2006
From: jdiezperezj at gmail.com (=?ISO-8859-1?Q?Javier_D=EDez?=)
Date: Mon, 24 Jul 2006 17:01:49 +0200
Subject: [BioPython] parse xml from flybase ?
In-Reply-To:
References:
Message-ID:
Hy everybody:
I'm new in biopython (congratulations to everyone is involved in
development).
For the first time i'm trying to parse some xml documents from flybase, but
I could not fully parse any big document.
I am using SAX, because because the documents I am parsing are bigger than
70- 100 MB.
My parser allways find something bad-formed in big documents, so I wrote a
little script too, to validates xml documents, (in fact I copied almost
completely from an on line manual), and it "tell me" that they are bad
formed too. I think that both script works, because I tested with
well-formed xml documents and they worked ok.
So, I have two questions:
Have anybody worked with flybase reports xml documents?
The second one, could you tell me if there is any python package to automate
bulk data retrieval (or big queries) from FlyBase?
Thank you
Javi
From jmjoseph at andrew.cmu.edu Mon Jul 24 16:27:39 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Mon, 24 Jul 2006 16:27:39 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44C39FC8.3060002@c2b2.columbia.edu>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
<44C39FC8.3060002@c2b2.columbia.edu>
Message-ID: <44C52D3B.4050704@andrew.cmu.edu>
Okay. I've started bug 2051:
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
I would greatly appreciate any comments, additions, and suggestions.
-Jacob
Michiel de Hoon wrote:
> Jacob Joseph wrote:
>> Great!
>>
>> Can someone point me to the current maintainer of the Blast parsing
>> package?
>>
>> -Jacob
>>
>
> As far as I know, we don't have an official maintainer of the Blast
> package. So your best bet is to submit a patch through bugzilla.
>
> --Michiel.
From manickam.muthuraman at wur.nl Wed Jul 26 11:12:12 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Wed, 26 Jul 2006 17:12:12 +0200
Subject: [BioPython] blast for more than one sequence
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
<44C39FC8.3060002@c2b2.columbia.edu>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
hai all
I have tried to blast through the internet, i can suceed with one sequence but i need to blast more than one sequences through internet.
for example
m_cold.fasta has 5 sequences but the code which is been stated in book can do only for the first sequence and rest it does not
file_for_blast=open('m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
#does the above line loop through the file and get each sequence for blasting, if so let me know how
i also tried another way but i do not want to confuse you by sending the whole script.
from
manickam
From biopython at maubp.freeserve.co.uk Wed Jul 26 11:51:03 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Wed, 26 Jul 2006 16:51:03 +0100
Subject: [BioPython] blast for more than one sequence
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu> <44BFE1FA.4040500@andrew.cmu.edu> <44C39FC8.3060002@c2b2.columbia.edu>
<4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
Message-ID: <44C78F67.9090901@maubp.freeserve.co.uk>
Muthuraman, Manickam wrote:
>
> hai all
>
> I have tried to blast through the internet, i can suceed with one
> sequence but i need to blast more than one sequences through
> internet.
I think you can only blast one sequence at a time. If you really want
to do many at once, then I personally would use standalone blast and
give it the fasta file as input.
> for example
> m_cold.fasta has 5 sequences but the code which is been stated in book
> can do only for the first sequence and rest it does not
>
> file_for_blast=open('m_cold.fasta','r')
> f_iterator=Fasta.Iterator(file_for_blast)
> #does the above line loop through the file and get each sequence for
> blasting, if so let me know how
You could use the iterator to get each FASTA sequence in turn, and run a
separate blast for it. Something like this...
from Bio import Fasta
from Bio.Blast import NCBIWWW
file_for_blast=open('m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
for f_record in f_iterator :
#do blast with f_record...
#using NCBIWWW.qblast('blastn', 'nr', f_record)
Peter
From persikan at gmail.com Thu Jul 27 15:16:10 2006
From: persikan at gmail.com (Anton)
Date: Thu, 27 Jul 2006 19:16:10 +0000 (UTC)
Subject: [BioPython] KDTree optional compilation
References: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>
Message-ID:
K. Arun gmail.com> writes:
>
> Hello,
>
> I just updated to version 1.42 (thanks !), and noticed I had to
> uncomment the KDTree entry in setup.py in the NUMPY_PACKAGES section
> to use Bio.PDB.NeighborSearch. Since KDTree is essential to this
> module, would it possible to add a note about this in the README, or
> print a message during compilation to the effect that setup.py needs
> to be modified ?
>
> -arun
> _______________________________________________
> BioPython mailing list - BioPython lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
I installed my BioPython using the Windows installer. Therefore the
NeighborSearch module is not working (requires KDTree). In fact there was no way
during the installation to add this module.
Can I install the KDTree somehow or I have to reinstall the BioPython using the
source code with uncommenting the KDTree entry?
I'm trying to avoid the installation from the source code.
Any suggestions?
Thanks!
From mdehoon at c2b2.columbia.edu Thu Jul 27 18:01:32 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 27 Jul 2006 18:01:32 -0400
Subject: [BioPython] KDTree optional compilation
In-Reply-To:
References: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>
Message-ID: <44C937BC.7020505@c2b2.columbia.edu>
Anton wrote:
> K. Arun gmail.com> writes:
>> I just updated to version 1.42 (thanks !), and noticed I had to
>> uncomment the KDTree entry in setup.py in the NUMPY_PACKAGES section
>> to use Bio.PDB.NeighborSearch. Since KDTree is essential to this
>> module, would it possible to add a note about this in the README, or
>> print a message during compilation to the effect that setup.py needs
>> to be modified ?
Yes. Can you write a patch for setup.py?
> I installed my BioPython using the Windows installer. Therefore the
> NeighborSearch module is not working (requires KDTree). In fact there was no way
> during the installation to add this module.
I thought I had included KDTree with the Windows installer, but it
turned out that setup.py automatically skips KDTree on Windows. This is
to avoid compilation problems with the C++ code in KDTree. It works on
some platforms, but not on others.
Luckily, MinGW can compile C++ code for Python 2.4 (but not for Python
2.3). I put a new Windows installer for Python 2.4 on the website (same
file name, but file size is now 1,074 Kb).
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From idoerg at burnham.org Mon Jul 3 17:52:44 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Mon, 03 Jul 2006 10:52:44 -0700
Subject: [BioPython] [Fwd: [OBF] Call For Birds of a Feather Suggestions]
Message-ID: <44A9596C.90208@burnham.org>
The BOSC organizing comittee is currently seeking suggestions for Birds
of a Feather meeting ideas. Birds of a Feather meetings are one of the
more popular activities at BOSC, occurring at the end of each days
session. These are free-form meetings organized by the attendees
themselves to discuss one or a few topics of interest in greater detail.
BOF?s have been formed to allow developers and users of individual OBF
software to meet each other face-to-face to discuss the project, or to
discuss completely new ideas, and even start new software development
projects. These meetings offer a unique opportunity for individuals to
explore more about the activities of the various Open Source Projects,
and, in some cases, even take an active role influencing the future of
Open Source Software development. If you would like to create a BOF,
just sign up for a wiki account, login, and edit the
BOSC
2006 Birds of a Feather page.
_______________________________________________
Open-Bioinformatics-Foundation mailing list
Open-Bioinformatics-Foundation at lists.open-bio.org
This is a broadcast-only announce list used to distribute emails to people who subscribe to OBF hosted email discussion or announce lists. To prevent our most active members from getting many duplicate copies of important announcements we created this list today so that only one email gets sent to each subscribed email address. You do not need to subscribe/unsubscribe from this lsit. Problems or Concerns? -- send an email to the OBF mailteam at: mailteam at open-bio.org
--
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org
From g.abraham at ms.unimelb.edu.au Sat Jul 8 09:12:34 2006
From: g.abraham at ms.unimelb.edu.au (Gad Abraham)
Date: Sat, 08 Jul 2006 19:12:34 +1000
Subject: [BioPython] PDBConstructionException message
Message-ID: <44AF7702.2040808@ms.unimelb.edu.au>
Hi,
I apologise in advance if this question is trivial, but I'm a Python and
BioPython novice, and the mailing list search facility isn't working :)
I'm using Bio.PDB to parse the PDB file 1LUC, and write output to stdout
(running python inside a script, not shown here).
PDBParser seems to be throwing an exception, and its message goes to
stdout, even though I'm trying to catch it:
Python 2.4.3 (#2, Apr 27 2006, 14:43:58)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.PDB import *
>>> parser = PDBParser()
>>> try:
... s = parser.get_structure('1LUC', 'pdb1luc')
... except:
... print "Caught exception"
...
PDBConstructionException: Atom O defined twice in residue at line 2930.
Exception ignored.
Some atoms or residues will be missing in the data structure.
>>>
(biopython 1.41)
How can I either catch this exception or prevent the message from being
printed to stdout?
Thanks,
Gad
--
Gad Abraham
Department of Mathematics and Statistics
University of Melbourne
Victoria 3010, Australia
email: g.abraham at ms.unimelb.edu.au
web: http://www.ms.unimelb.edu.au/~gabraham
From mdehoon at c2b2.columbia.edu Sat Jul 8 17:42:41 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 08 Jul 2006 13:42:41 -0400
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <44AF7702.2040808@ms.unimelb.edu.au>
References: <44AF7702.2040808@ms.unimelb.edu.au>
Message-ID: <44AFEE91.6060404@c2b2.columbia.edu>
From looking at the Bio.PDB.PDBParser source code, it appears that
>>> parser = PDBParser(PERMISSIVE=0)
should solve your problem.
Use
>>> from Bio.PDB import *
>>> help(PDBParser)
for more information.
--Michiel.
Gad Abraham wrote:
> Hi,
>
> I apologise in advance if this question is trivial, but I'm a Python and
> BioPython novice, and the mailing list search facility isn't working :)
>
> I'm using Bio.PDB to parse the PDB file 1LUC, and write output to stdout
> (running python inside a script, not shown here).
>
>
>
> PDBParser seems to be throwing an exception, and its message goes to
> stdout, even though I'm trying to catch it:
>
> Python 2.4.3 (#2, Apr 27 2006, 14:43:58)
> [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from Bio.PDB import *
> >>> parser = PDBParser()
> >>> try:
> ... s = parser.get_structure('1LUC', 'pdb1luc')
> ... except:
> ... print "Caught exception"
> ...
> PDBConstructionException: Atom O defined twice in residue het= resseq=355 icode= > at line 2930.
> Exception ignored.
> Some atoms or residues will be missing in the data structure.
> >>>
>
> (biopython 1.41)
>
>
>
> How can I either catch this exception or prevent the message from being
> printed to stdout?
>
>
> Thanks,
> Gad
>
From g.abraham at ms.unimelb.edu.au Mon Jul 10 10:12:34 2006
From: g.abraham at ms.unimelb.edu.au (Gad Abraham)
Date: Mon, 10 Jul 2006 20:12:34 +1000
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <44AFEE91.6060404@c2b2.columbia.edu>
References: <44AF7702.2040808@ms.unimelb.edu.au>
<44AFEE91.6060404@c2b2.columbia.edu>
Message-ID: <44B22812.40004@ms.unimelb.edu.au>
Michiel de Hoon wrote:
> From looking at the Bio.PDB.PDBParser source code, it appears that
> >>> parser = PDBParser(PERMISSIVE=0)
> should solve your problem.
>
> Use
> >>> from Bio.PDB import *
> >>> help(PDBParser)
> for more information.
>
> --Michiel.
>
The docs say:
o PERMISSIVE - int, if this is 0 exceptions in constructing the
SMCRA data structure are fatal. If 1 (DEFAULT), the exceptions are
caught, but some residues or atoms will be missing. THESE EXCEPTIONS
ARE DUE TO PROBLEMS IN THE PDB FILE!.
however, I don't mind if the chains are somewhat broken, as long as I
can catch the exception and prevent it from printing to stdout (which is
what the except clause is trying to do).
Why doesn't the except clause catch the exception?
Thanks,
Gad
--
Gad Abraham
Department of Mathematics and Statistics
University of Melbourne
Victoria 3010, Australia
email: g.abraham at ms.unimelb.edu.au
web: http://www.ms.unimelb.edu.au/~gabraham
From f.schlesinger at iu-bremen.de Mon Jul 10 12:33:59 2006
From: f.schlesinger at iu-bremen.de (Felix Schlesinger)
Date: Mon, 10 Jul 2006 14:33:59 +0200
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <7317d50c0607100424y17559da1x4eff48d180c4e3b@mail.gmail.com>
References: <44AF7702.2040808@ms.unimelb.edu.au>
<44AFEE91.6060404@c2b2.columbia.edu>
<44B22812.40004@ms.unimelb.edu.au>
<7317d50c0607100424y17559da1x4eff48d180c4e3b@mail.gmail.com>
Message-ID: <7317d50c0607100533p746d7d29lc8c0d5e5c43aa2e6@mail.gmail.com>
Hi.
> > >>> parser = PDBParser(PERMISSIVE=0)
> > should solve your problem.
> Why doesn't the except clause catch the exception?
>From a brief look, it seems that with permissive=1 the exception is
caught inside the parser and handled by printingt a warning. If you
switch to permissive=0 the exception will propagate and you can catch
it.
Felix
From biopython at maubp.freeserve.co.uk Mon Jul 10 11:25:06 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Mon, 10 Jul 2006 12:25:06 +0100
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <44B22812.40004@ms.unimelb.edu.au>
References: <44AF7702.2040808@ms.unimelb.edu.au> <44AFEE91.6060404@c2b2.columbia.edu>
<44B22812.40004@ms.unimelb.edu.au>
Message-ID: <44B23912.2020006@maubp.freeserve.co.uk>
Hi Gad,
I assume you've read Michael's post, but to recap:
If using PERMISSIVE=True (or 1 if you really like integers) then the
parser will catch "warning" exceptions internally, print them to screen,
and continue.
e.g.
from Bio.PDB.PDBParser import PDBParser
pdb = "1LUC"
filename = "1LUC.pdb"
s = parser.get_structure(pdb, filename)
results in:
PDBConstructionException: Atom O defined twice in residue at line 2930.
Exception ignored.
Some atoms or residues will be missing in the data structure.
If you use the PERMISSIVE=False (or 0 if you like integers) then the
parser will stop at the exception and you can catch it:
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.PDBExceptions import PDBConstructionException
pdb = "1LUC"
filename = "1LUC.pdb"
parser = PDBParser(PERMISSIVE=False)
try :
s = parser.get_structure(pdb, filename)
except PDBConstructionException, message:
print "Exception:"
print message
Giving:
Exception:
Atom O defined twice in residue
at line 2930.
However, this doesn't actually load the whole PDB file - which I think
you wanted, based on your original question.
Gad Abraham wrote:
> I'm using Bio.PDB to parse the PDB file 1LUC, and write output to
> stdout (running python inside a script, not shown here).
I think you want to use the permissive parser (i.e. accept this invalid
but understandable file) WITHOUT printing the warning to screen.
This isn't currently possible - you will have to edit the module to do
this yourself. Open /Bio/PDB/PDBParser.py and find the section
_handle_PDB_exception and change this bit (edited for line widths):
if self.PERMISSIVE:
# just print a warning - some residues/atoms will be missing
print "PDBConstructionException: %s" % message
print "Exception ignored.\nSome atoms or residues will be ..."
else:
# exceptions are fatal - raise again with new message (...)
raise PDBConstructionException, message
To something like this:
if self.PERMISSIVE:
#QUICK HACK, don't print a warning - just carry on.
#Note some residues/atoms will be missing
pass
else:
# exceptions are fatal - raise again with new message (...)
raise PDBConstructionException, message
Thinking long term, maybe we should add an option not to print these
warnings... or send to stderr instead of stdout perhaps?
Peter
From g.abraham at ms.unimelb.edu.au Mon Jul 10 13:00:41 2006
From: g.abraham at ms.unimelb.edu.au (Gad Abraham)
Date: Mon, 10 Jul 2006 23:00:41 +1000
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <44B23912.2020006@maubp.freeserve.co.uk>
References: <44AF7702.2040808@ms.unimelb.edu.au>
<44AFEE91.6060404@c2b2.columbia.edu>
<44B22812.40004@ms.unimelb.edu.au>
<44B23912.2020006@maubp.freeserve.co.uk>
Message-ID: <44B24F79.4020501@ms.unimelb.edu.au>
Peter (BioPython List) wrote:
> I think you want to use the permissive parser (i.e. accept this invalid
> but understandable file) WITHOUT printing the warning to screen.
That's right.
>
> This isn't currently possible - you will have to edit the module to do
> this yourself. Open /Bio/PDB/PDBParser.py and find the section
> _handle_PDB_exception and change this bit (edited for line widths):
>
> if self.PERMISSIVE:
> # just print a warning - some residues/atoms will be missing
> print "PDBConstructionException: %s" % message
> print "Exception ignored.\nSome atoms or residues will be ..."
> else:
> # exceptions are fatal - raise again with new message (...)
> raise PDBConstructionException, message
>
> To something like this:
>
> if self.PERMISSIVE:
> #QUICK HACK, don't print a warning - just carry on.
> #Note some residues/atoms will be missing
> pass
> else:
> # exceptions are fatal - raise again with new message (...)
> raise PDBConstructionException, message
>
> Thinking long term, maybe we should add an option not to print these
> warnings... or send to stderr instead of stdout perhaps?
>
I was just about to send off a message suggesting using sys.stderr.write
instead of print in PDBParser :)
After editing PDBParser.py as you suggested, I still get the same
behaviour (under Ubuntu it's at
/usr/lib/python2.4/site-packages/Bio/PDB/PDBParser.py).
Do I need to generate new pyc/pyo files?
Thanks,
Gad
--
Gad Abraham
Department of Mathematics and Statistics
University of Melbourne
Victoria 3010, Australia
email: g.abraham at ms.unimelb.edu.au
web: http://www.ms.unimelb.edu.au/~gabraham
From g.abraham at ms.unimelb.edu.au Mon Jul 10 13:18:15 2006
From: g.abraham at ms.unimelb.edu.au (Gad Abraham)
Date: Mon, 10 Jul 2006 23:18:15 +1000
Subject: [BioPython] PDBConstructionException message
In-Reply-To: <44B24F79.4020501@ms.unimelb.edu.au>
References: <44AF7702.2040808@ms.unimelb.edu.au>
<44AFEE91.6060404@c2b2.columbia.edu>
<44B22812.40004@ms.unimelb.edu.au>
<44B23912.2020006@maubp.freeserve.co.uk>
<44B24F79.4020501@ms.unimelb.edu.au>
Message-ID: <44B25397.8050804@ms.unimelb.edu.au>
Gad Abraham wrote:
> After editing PDBParser.py as you suggested, I still get the same
> behaviour (under Ubuntu it's at
> /usr/lib/python2.4/site-packages/Bio/PDB/PDBParser.py).
> Do I need to generate new pyc/pyo files?
>
> Thanks,
> Gad
>
Please ignore my previous question - I found I had another biopython
installation on the PYTHONPATH which was getting in the way. It works
fine now!
Thanks,
Gad
--
Gad Abraham
Department of Mathematics and Statistics
University of Melbourne
Victoria 3010, Australia
email: g.abraham at ms.unimelb.edu.au
web: http://www.ms.unimelb.edu.au/~gabraham
From admin.cluster at gmail.com Wed Jul 12 13:52:56 2006
From: admin.cluster at gmail.com (Anthony)
Date: Wed, 12 Jul 2006 15:52:56 +0200
Subject: [BioPython] (no subject)
Message-ID: <44B4FEB8.30807@gmail.com>
From admin.cluster at gmail.com Wed Jul 12 13:55:13 2006
From: admin.cluster at gmail.com (Anthony)
Date: Wed, 12 Jul 2006 15:55:13 +0200
Subject: [BioPython] import Standalone problemss
Message-ID: <44B4FF41.9070608@gmail.com>
Hello,
i have a problem with Standalone..
$
from Bio.Blast import Standalone
ImportError: cannot import name Standalone
while the
from Bio.Blast import NCBIStandalone
passes ...
any suggestion why ?
how can i install Standalone.
From biopython at maubp.freeserve.co.uk Wed Jul 12 23:09:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Jul 2006 00:09:53 +0100
Subject: [BioPython] import Standalone problemss
In-Reply-To: <44B4FF41.9070608@gmail.com>
References: <44B4FF41.9070608@gmail.com>
Message-ID: <44B58141.5080804@maubp.freeserve.co.uk>
Anthony wrote:
> Hello,
> i have a problem with Standalone..
>
> $
> from Bio.Blast import Standalone
> ImportError: cannot import name Standalone
>
> while the
> from Bio.Blast import NCBIStandalone
> passes ...
>
> any suggestion why ?
> how can i install Standalone.
Why are you trying to? Are you trying to use a non-NBCI version of
Blast maybe?
The second version (which as you say, works) is correct for using
BioPython with the standalone (i.e. on you own machine) version of the
NCBI's Blast program:
from Bio.Blast import NCBIStandalone
Peter
From admin.cluster at gmail.com Thu Jul 13 12:17:26 2006
From: admin.cluster at gmail.com (Anthony)
Date: Thu, 13 Jul 2006 14:17:26 +0200
Subject: [BioPython] import Standalone problemss
In-Reply-To: <44B58141.5080804@maubp.freeserve.co.uk>
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk>
Message-ID: <44B639D6.2010503@gmail.com>
Peter wrote:
> Anthony wrote:
>
>> Hello,
>> i have a problem with Standalone..
>>
>> $
>> from Bio.Blast import Standalone
>> ImportError: cannot import name Standalone
>>
>> while the
>> from Bio.Blast import NCBIStandalone
>> passes ...
>>
>> any suggestion why ?
>> how can i install Standalone.
>
>
> Why are you trying to? Are you trying to use a non-NBCI version of
> Blast maybe?
>
Actually,
we are debugging a old program that was written by somebody who doesnt
work here anymore, and in the code it was written:
from Bio.Blast import Standalone
from Bio.Blast import NCBIStandalone
and this program is not working anymore.
> The second version (which as you say, works) is correct for using
> BioPython with the standalone (i.e. on you own machine) version of the
> NCBI's Blast program:
>
> from Bio.Blast import NCBIStandalone
>
> Peter
>
>
From rohini.damle at gmail.com Thu Jul 13 17:56:48 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 13 Jul 2006 10:56:48 -0700
Subject: [BioPython] import Standalone problemss
In-Reply-To: <44B639D6.2010503@gmail.com>
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com>
Message-ID:
Hi,
I am trying to use biopyton's xml parser to parse my blast result.
Because of NCBI's changed XML output, the current biopython's parser
fails.
So I got an advice to run older version of blast locally and get the
xml out put in older version and then use current XML parser.
but I am getting an error
SAXParseException: :553:0: junk after document element
what might be the reson?
Thank you in advance.
From biopython at maubp.freeserve.co.uk Thu Jul 13 21:33:45 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Jul 2006 22:33:45 +0100
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com> <44B58141.5080804@maubp.freeserve.co.uk>
<44B639D6.2010503@gmail.com>
Message-ID: <44B6BC39.4050606@maubp.freeserve.co.uk>
Rohini Damle wrote:
> Hi,
> I am trying to use biopyton's xml parser to parse my blast result.
> Because of NCBI's changed XML output, the current biopython's parser
> fails.
>
> So I got an advice to run older version of blast locally and get the
> xml out put in older version and then use current XML parser.
> but I am getting an error
>
> SAXParseException: :553:0: junk after document element
>
> what might be the reson?
>
Not off hand. We would usually ask to see the problem XML file...
Do you get the same problem if you run Blast by hand, saving the output
to a file, and then trying to parse the XML file with BioPython?
Also, which version of standalone blast are you using?
Peter
From rohini.damle at gmail.com Fri Jul 14 19:17:17 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Fri, 14 Jul 2006 12:17:17 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44B6BC39.4050606@maubp.freeserve.co.uk>
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
Message-ID:
Hi,
I updated the NcbiStandalone.py file and then the program works fine.
Thanks.
Rohini.
On 7/13/06, Peter wrote:
> Rohini Damle wrote:
> > Hi,
> > I am trying to use biopyton's xml parser to parse my blast result.
> > Because of NCBI's changed XML output, the current biopython's parser
> > fails.
> >
> > So I got an advice to run older version of blast locally and get the
> > xml out put in older version and then use current XML parser.
> > but I am getting an error
> >
> > SAXParseException: :553:0: junk after document element
> >
> > what might be the reson?
> >
>
> Not off hand. We would usually ask to see the problem XML file...
>
> Do you get the same problem if you run Blast by hand, saving the output
> to a file, and then trying to parse the XML file with BioPython?
>
> Also, which version of standalone blast are you using?
>
> Peter
>
>
From cariaso at yahoo.com Sat Jul 15 06:23:01 2006
From: cariaso at yahoo.com (Mike Cariaso)
Date: Fri, 14 Jul 2006 23:23:01 -0700 (PDT)
Subject: [BioPython] patch attached. a change in the eutils dtd
Message-ID: <20060715062301.66920.qmail@web52703.mail.yahoo.com>
NCBI recently changed the EUtils
http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html
Could someone please review and commit the attached patch?
It replaces Bio/EUtils/DTDs/eSearch_020511 with the new version found at
http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd
as well as updating the auto-generated .py file.
This resolves the error message:
Undefined element tag: QueryTranslation
--
Mike Cariaso
Bioinformatics software http://cariaso.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eSearch1_5a.patch
Type: application/octet-stream
Size: 3679 bytes
Desc: not available
URL:
From biopython at maubp.freeserve.co.uk Sat Jul 15 09:02:14 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 15 Jul 2006 10:02:14 +0100
Subject: [BioPython] patch attached. a change in the eutils dtd
In-Reply-To: <20060715062301.66920.qmail@web52703.mail.yahoo.com>
References: <20060715062301.66920.qmail@web52703.mail.yahoo.com>
Message-ID: <44B8AF16.3030709@maubp.freeserve.co.uk>
Mike Cariaso wrote:
> NCBI recently changed the EUtils
> http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html
>
> Could someone please review and commit the attached patch?
>
> It replaces Bio/EUtils/DTDs/eSearch_020511 with the new version found
> at http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd as
> well as updating the auto-generated .py file.
>
> This resolves the error message: Undefined element tag:
> QueryTranslation
I see on their website:
> Changes in last update (1.5a):
> - QueryTranslation field is added to eSerachResult; this field
> contains an actual query executed by the search engine.
> QueryTranslation is recommended for use instead of
> TranslationStack.
> - TranslationStackType is changed to follow eSearch DTD.
Does this mean we should make a further change to the BioPython code to
use this new field QueryTranslation instead of TranslationStack?
(Note I've not looked at the BioPython code for Entrez)
Peter
From mdehoon at c2b2.columbia.edu Sun Jul 16 18:35:07 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Sun, 16 Jul 2006 14:35:07 -0400
Subject: [BioPython] patch attached. a change in the eutils dtd
In-Reply-To: <20060715062301.66920.qmail@web52703.mail.yahoo.com>
References: <20060715062301.66920.qmail@web52703.mail.yahoo.com>
Message-ID: <44BA86DB.8050604@c2b2.columbia.edu>
I have committed this patch to CVS; it will be in the upcoming Biopython
release.
--Michiel.
Mike Cariaso wrote:
> NCBI recently changed the EUtils
> http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html
>
> Could someone please review and commit the attached patch?
>
> It replaces Bio/EUtils/DTDs/eSearch_020511 with the new version found at
> http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd
> as well as updating the auto-generated .py file.
>
> This resolves the error message:
> Undefined element tag: QueryTranslation
>
>
> --
> Mike Cariaso
> Bioinformatics software http://cariaso.com/
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From mdehoon at c2b2.columbia.edu Mon Jul 17 00:16:42 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Sun, 16 Jul 2006 20:16:42 -0400
Subject: [BioPython] Biopython release 1.42
Message-ID: <44BAD6EA.6050800@c2b2.columbia.edu>
Dear biopythoneers,
We are pleased to announce the release of Biopython 1.42. This release
includes a brand-new Genbank parser in Bio.GenBank by Peter Cock,
numerous updates to Bio.Nexus by Frank Kauff and to Bio.Geo by Peter,
lots of bug fixes by scores of contributers through BugZilla, and
Bio.Cluster became object-oriented.
Source distributions and Windows installers are available from our
spiffy new Wiki-based website at http://biopython.org. My thanks to all
code contributers who made this new release possible.
--Michiel on behalf of the Biopython developers
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From idoerg at burnham.org Mon Jul 17 03:57:20 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sun, 16 Jul 2006 20:57:20 -0700
Subject: [BioPython] Biopython release 1.42
In-Reply-To: <44BAD6EA.6050800@c2b2.columbia.edu>
References: <44BAD6EA.6050800@c2b2.columbia.edu>
Message-ID: <44BB0AA0.2090609@burnham.org>
Great job Michiel. Thanks for all your hard work on this. 1.42 was
sorely needed. Thanks for seeing this through.
Best,
Iddo
Michiel Jan Laurens de Hoon wrote:
> Dear biopythoneers,
>
> We are pleased to announce the release of Biopython 1.42. This release
> includes a brand-new Genbank parser in Bio.GenBank by Peter Cock,
> numerous updates to Bio.Nexus by Frank Kauff and to Bio.Geo by Peter,
> lots of bug fixes by scores of contributers through BugZilla, and
> Bio.Cluster became object-oriented.
>
> Source distributions and Windows installers are available from our
> spiffy new Wiki-based website at http://biopython.org. My thanks to all
> code contributers who made this new release possible.
>
> --Michiel on behalf of the Biopython developers
>
>
>
--
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
Tel: (858) 646 3100 x3516
Fax: (858) 795 5249 ** NEW **
http://iddo-friedberg.org
http://BioFunctionPrediction.org
From rohini.damle at gmail.com Tue Jul 18 19:38:42 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Tue, 18 Jul 2006 12:38:42 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
Message-ID:
Hi,
I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
I am trying to extract alignment information for each of them.
So I wrote the following code:
for b_record in b_iterator :
E_VALUE_THRESH =20
for alignment in b_record.alignments:
for hsp in alignment.hsps:
if hsp.expect< E_VALUE_THRESH:
print '****Alignment****'
print 'sequence:', alignment.title.split()[0]
With this code, I am getting information for P1,
then information for P1 + P2
then for P1+P2 +P3
and finally for P1+P2+P3+P4
why this is so?
is there something wrong with the looping?
Thank you for the help.
Rohini.
On 7/14/06, Rohini Damle wrote:
> Hi,
> I updated the NcbiStandalone.py file and then the program works fine.
> Thanks.
> Rohini.
>
> On 7/13/06, Peter wrote:
> > Rohini Damle wrote:
> > > Hi,
> > > I am trying to use biopyton's xml parser to parse my blast result.
> > > Because of NCBI's changed XML output, the current biopython's parser
> > > fails.
> > >
> > > So I got an advice to run older version of blast locally and get the
> > > xml out put in older version and then use current XML parser.
> > > but I am getting an error
> > >
> > > SAXParseException: :553:0: junk after document element
> > >
> > > what might be the reson?
> > >
> >
> > Not off hand. We would usually ask to see the problem XML file...
> >
> > Do you get the same problem if you run Blast by hand, saving the output
> > to a file, and then trying to parse the XML file with BioPython?
> >
> > Also, which version of standalone blast are you using?
> >
> > Peter
> >
> >
>
From biopython at maubp.freeserve.co.uk Tue Jul 18 20:40:26 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 18 Jul 2006 21:40:26 +0100
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com> <44B58141.5080804@maubp.freeserve.co.uk>
<44B639D6.2010503@gmail.com> <44B6BC39.4050606@maubp.freeserve.co.uk>
Message-ID: <44BD473A.6030903@maubp.freeserve.co.uk>
Rohini Damle wrote:
> Hi,
> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> I am trying to extract alignment information for each of them.
> So I wrote the following code:
>
> for b_record in b_iterator :
>
> E_VALUE_THRESH =20
> for alignment in b_record.alignments:
> for hsp in alignment.hsps:
> if hsp.expect< E_VALUE_THRESH:
>
> print '****Alignment****'
> print 'sequence:', alignment.title.split()[0]
>
> With this code, I am getting information for P1,
> then information for P1 + P2
> then for P1+P2 +P3
> and finally for P1+P2+P3+P4
> why this is so?
> is there something wrong with the looping?
I'm aware of something funny with the XML parsing, Bug 1970, which might
well be the same issue:
http://bugzilla.open-bio.org/show_bug.cgi?id=1970
I confess I haven't looked into exactly what is going wrong here - too
many other demands on my time to learn about XML and how BioPython
parses it.
Does the work around on the bug report help? Depending on which version
of standalone blast you have installed, you might have better luck with
plain text output - the trouble is this is a moving target and the NBCI
keeps tweaking it.
Peter
From stautxie at gmail.com Wed Jul 19 03:54:29 2006
From: stautxie at gmail.com (Xie Wei)
Date: Tue, 18 Jul 2006 23:54:29 -0400
Subject: [BioPython] Question on biopython 1.42 test mode
Message-ID: <29463e260607182054o49f124eas898ecfe19c95cedb@mail.gmail.com>
Dear Biopython developers,
I recently downloaded and installed biopython 1.42 on my linux workstation.
The trouble occurred when I tried to run the test suite by
python setup.py test
The script froze there for more than 10 minutes without proceeding when
testing cluster module, and CPU usage is like 99%+. I do not know whether
the cluster module test suite really should take that much time. I did not
remember having this kind of problem with the last biopython release.
Thanks,
Wei
From mdehoon at c2b2.columbia.edu Wed Jul 19 04:58:04 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Wed, 19 Jul 2006 00:58:04 -0400
Subject: [BioPython] Question on biopython 1.42 test mode
In-Reply-To: <29463e260607182054o49f124eas898ecfe19c95cedb@mail.gmail.com>
References: <29463e260607182054o49f124eas898ecfe19c95cedb@mail.gmail.com>
Message-ID: <44BDBBDC.2090609@c2b2.columbia.edu>
Xie Wei wrote:
> The script froze there for more than 10 minutes without proceeding when
> testing cluster module, and CPU usage is like 99%+. I do not know whether
> the cluster module test suite really should take that much time.
It shouldn't.
I wasn't able to replicate this problem (I tried on Mac OS X and on
Linux). Do you get the same problem if you run
python run_tests.py --no-gui
from the biopython-1.42/Tests directory?
It may also be informative to run
python test_Cluster.py
from the same directory; you may be able to find out at what point the
script hangs.
Sorry for the trouble.
--Michiel.
From jacob at jjoseph.org Wed Jul 19 04:48:56 2006
From: jacob at jjoseph.org (Jacob Joseph)
Date: Wed, 19 Jul 2006 00:48:56 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BD473A.6030903@maubp.freeserve.co.uk>
References: <44B4FF41.9070608@gmail.com> <44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com> <44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
Message-ID: <44BDB9B8.2060808@jjoseph.org>
Hi.
I encountered similar difficulties over the past few days myself and
have made some improvements to the XML parser. Well, that is, it now
functions with blastall, but I have made no effort to parse the other
blast programs. I do not expect I have done any harm to other parsing,
however.
Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
yet spent significant time to clean up my changes. Without getting into
specific modifications, I have made an effort to make consistent the
variables in Record and NCBIXML, focusing primarily on what I needed
this week.
One portion I am not settled on reinitialization of Record.Blast at
every call to iterator.next(), and, by extension, BlastParser.parse().
See NCBIXML.py, line 114. Without re-initializing this class, we run
the risk of retaining portions of a Record from previously parsed
queries. This causes the bug 1970, mentioned below. Unfortunately,
this re-initialization exacts a significant performance penalty of at
least a factor of 10 by some rough measures. I would appreciate any
suggestions for improvement here.
I do apologize for not being more specific about my changes. When I get
a chance(next week?), I will package them up as a proper patch and file
a bug. Perhaps what I have done so far will be of use until then.
fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
not have separate blocks within its output, requiring a different
method of iteration.
-Jacob
Peter wrote:
> Rohini Damle wrote:
>> Hi,
>> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> I am trying to extract alignment information for each of them.
>> So I wrote the following code:
>>
>> for b_record in b_iterator :
>>
>> E_VALUE_THRESH =20
>> for alignment in b_record.alignments:
>> for hsp in alignment.hsps:
>> if hsp.expect< E_VALUE_THRESH:
>>
>> print '****Alignment****'
>> print 'sequence:', alignment.title.split()[0]
>>
>> With this code, I am getting information for P1,
>> then information for P1 + P2
>> then for P1+P2 +P3
>> and finally for P1+P2+P3+P4
>> why this is so?
>> is there something wrong with the looping?
>
> I'm aware of something funny with the XML parsing, Bug 1970, which might
> well be the same issue:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>
> I confess I haven't looked into exactly what is going wrong here - too
> many other demands on my time to learn about XML and how BioPython
> parses it.
>
> Does the work around on the bug report help? Depending on which version
> of standalone blast you have installed, you might have better luck with
> plain text output - the trouble is this is a moving target and the NBCI
> keeps tweaking it.
>
> Peter
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIStandalone.py
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIXML.py
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Record.py
URL:
From jmjoseph at andrew.cmu.edu Wed Jul 19 05:00:00 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Wed, 19 Jul 2006 01:00:00 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BD473A.6030903@maubp.freeserve.co.uk>
References: <44B4FF41.9070608@gmail.com> <44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com> <44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
Message-ID: <44BDBC50.7050809@andrew.cmu.edu>
Hi.
I encountered similar difficulties over the past few days myself and
have made some improvements to the XML parser. Well, that is, it now
functions with blastall, but I have made no effort to parse the other
blast programs. I do not expect I have done any harm to other parsing,
however.
Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
yet spent significant time to clean up my changes. Without getting into
specific modifications, I have made an effort to make consistent the
variables in Record and NCBIXML, focusing primarily on what I needed
this week.
One portion I am not settled on reinitialization of Record.Blast at
every call to iterator.next(), and, by extension, BlastParser.parse().
See NCBIXML.py, line 114. Without re-initializing this class, we run
the risk of retaining portions of a Record from previously parsed
queries. This causes the bug 1970, mentioned below. Unfortunately,
this re-initialization exacts a significant performance penalty of at
least a factor of 10 by some rough measures. I would appreciate any
suggestions for improvement here.
I do apologize for not being more specific about my changes. When I get
a chance(next week?), I will package them up as a proper patch and file
a bug. Perhaps what I have done so far will be of use until then.
fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
not have separate blocks within its output, requiring a different
method of iteration.
-Jacob
Peter wrote:
> Rohini Damle wrote:
>> Hi,
>> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> I am trying to extract alignment information for each of them.
>> So I wrote the following code:
>>
>> for b_record in b_iterator :
>>
>> E_VALUE_THRESH =20
>> for alignment in b_record.alignments:
>> for hsp in alignment.hsps:
>> if hsp.expect< E_VALUE_THRESH:
>>
>> print '****Alignment****'
>> print 'sequence:', alignment.title.split()[0]
>>
>> With this code, I am getting information for P1,
>> then information for P1 + P2
>> then for P1+P2 +P3
>> and finally for P1+P2+P3+P4
>> why this is so?
>> is there something wrong with the looping?
>
> I'm aware of something funny with the XML parsing, Bug 1970, which might
> well be the same issue:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>
> I confess I haven't looked into exactly what is going wrong here - too
> many other demands on my time to learn about XML and how BioPython
> parses it.
>
> Does the work around on the bug report help? Depending on which version
> of standalone blast you have installed, you might have better luck with
> plain text output - the trouble is this is a moving target and the NBCI
> keeps tweaking it.
>
> Peter
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIStandalone.py
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NCBIXML.py
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Record.py
URL:
From rohini.damle at gmail.com Wed Jul 19 19:09:06 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 19 Jul 2006 12:09:06 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BDB9B8.2060808@jjoseph.org>
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
Message-ID:
Hi,
Can someone suggest me for which version of Blast, the Biopython's
(text or xml) parser works fine?
I will download that blast version locally and can use biopython's parser.
thanx,
Rohini
On 7/18/06, Jacob Joseph wrote:
> Hi.
> I encountered similar difficulties over the past few days myself and
> have made some improvements to the XML parser. Well, that is, it now
> functions with blastall, but I have made no effort to parse the other
> blast programs. I do not expect I have done any harm to other parsing,
> however.
>
> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> yet spent significant time to clean up my changes. Without getting into
> specific modifications, I have made an effort to make consistent the
> variables in Record and NCBIXML, focusing primarily on what I needed
> this week.
>
> One portion I am not settled on reinitialization of Record.Blast at
> every call to iterator.next(), and, by extension, BlastParser.parse().
> See NCBIXML.py, line 114. Without re-initializing this class, we run
> the risk of retaining portions of a Record from previously parsed
> queries. This causes the bug 1970, mentioned below. Unfortunately,
> this re-initialization exacts a significant performance penalty of at
> least a factor of 10 by some rough measures. I would appreciate any
> suggestions for improvement here.
>
> I do apologize for not being more specific about my changes. When I get
> a chance(next week?), I will package them up as a proper patch and file
> a bug. Perhaps what I have done so far will be of use until then.
>
> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> not have separate blocks within its output, requiring a different
> method of iteration.
>
> -Jacob
>
> Peter wrote:
> > Rohini Damle wrote:
> >> Hi,
> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> >> I am trying to extract alignment information for each of them.
> >> So I wrote the following code:
> >>
> >> for b_record in b_iterator :
> >>
> >> E_VALUE_THRESH =20
> >> for alignment in b_record.alignments:
> >> for hsp in alignment.hsps:
> >> if hsp.expect< E_VALUE_THRESH:
> >>
> >> print '****Alignment****'
> >> print 'sequence:', alignment.title.split()[0]
> >>
> >> With this code, I am getting information for P1,
> >> then information for P1 + P2
> >> then for P1+P2 +P3
> >> and finally for P1+P2+P3+P4
> >> why this is so?
> >> is there something wrong with the looping?
> >
> > I'm aware of something funny with the XML parsing, Bug 1970, which might
> > well be the same issue:
> >
> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >
> > I confess I haven't looked into exactly what is going wrong here - too
> > many other demands on my time to learn about XML and how BioPython
> > parses it.
> >
> > Does the work around on the bug report help? Depending on which version
> > of standalone blast you have installed, you might have better luck with
> > plain text output - the trouble is this is a moving target and the NBCI
> > keeps tweaking it.
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
> # Patches by Mike Poidinger to support multiple databases.
>
> """
> This module provides code to work with the standalone version of
> BLAST, either blastall or blastpgp, provided by the NCBI.
> http://www.ncbi.nlm.nih.gov/BLAST/
>
> Classes:
> LowQualityBlastError Except that indicates low quality query sequences.
> BlastParser Parses output from blast.
> BlastErrorParser Parses output and tries to diagnose possible errors.
> PSIBlastParser Parses output from psi-blast.
> Iterator Iterates over a file of blast results.
>
> _Scanner Scans output from standalone BLAST.
> _BlastConsumer Consumes output from blast.
> _PSIBlastConsumer Consumes output from psi-blast.
> _HeaderConsumer Consumes header information.
> _DescriptionConsumer Consumes description information.
> _AlignmentConsumer Consumes alignment information.
> _HSPConsumer Consumes hsp information.
> _DatabaseReportConsumer Consumes database report information.
> _ParametersConsumer Consumes parameters information.
>
> Functions:
> blastall Execute blastall.
> blastpgp Execute blastpgp.
> rpsblast Execute rpsblast.
>
> """
>
> from __future__ import generators
> import os
> import re
>
> from Bio import File
> from Bio.ParserSupport import *
> from Bio.Blast import Record
>
>
> class LowQualityBlastError(Exception):
> """Error caused by running a low quality sequence through BLAST.
>
> When low quality sequences (like GenBank entries containing only
> stretches of a single nucleotide) are BLASTed, they will result in
> BLAST generating an error and not being able to perform the BLAST.
> search. This error should be raised for the BLAST reports produced
> in this case.
> """
> pass
>
> class ShortQueryBlastError(Exception):
> """Error caused by running a short query sequence through BLAST.
>
> If the query sequence is too short, BLAST outputs warnings and errors:
> Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch failed.
> [blastall] ERROR: [000.000] AT1G08320: Blast:
> [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at least wordsize
> done
>
> This exception is raised when that condition is detected.
>
> """
> pass
>
>
> class _Scanner:
> """Scan BLAST output from blastall or blastpgp.
>
> Tested with blastall and blastpgp v2.0.10, v2.0.11
>
> Methods:
> feed Feed data into the scanner.
>
> """
> def feed(self, handle, consumer):
> """S.feed(handle, consumer)
>
> Feed in a BLAST report for scanning. handle is a file-like
> object that contains the BLAST report. consumer is a Consumer
> object that will receive events as the report is scanned.
>
> """
> if isinstance(handle, File.UndoHandle):
> uhandle = handle
> else:
> uhandle = File.UndoHandle(handle)
>
> # Try to fast-forward to the beginning of the blast report.
> read_and_call_until(uhandle, consumer.noevent, contains='BLAST')
> # Now scan the BLAST report.
> self._scan_header(uhandle, consumer)
> self._scan_rounds(uhandle, consumer)
> self._scan_database_report(uhandle, consumer)
> self._scan_parameters(uhandle, consumer)
>
> def _scan_header(self, uhandle, consumer):
> # BLASTP 2.0.10 [Aug-26-1999]
> #
> #
> # Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaf
> # Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> # "Gapped BLAST and PSI-BLAST: a new generation of protein database sea
> # programs", Nucleic Acids Res. 25:3389-3402.
> #
> # Query= test
> # (140 letters)
> #
> # Database: sdqib40-1.35.seg.fa
> # 1323 sequences; 223,339 total letters
> #
>
> consumer.start_header()
>
> read_and_call(uhandle, consumer.version, contains='BLAST')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the reference lines and the following blank line.
> # There might be a line, for qblast output.
> attempt_read_and_call(uhandle, consumer.noevent, start="")
> read_and_call(uhandle, consumer.reference, start='Reference')
> while 1:
> line = uhandle.readline()
> if is_blank_line(line) or line.startswith("RID"):
> consumer.noevent(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> break
> consumer.reference(line)
>
> # blastpgp has a Reference for composition-based statistics.
> if attempt_read_and_call(
> uhandle, consumer.reference, start="Reference"):
> read_and_call_until(uhandle, consumer.reference, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the Query lines and the following blank line.
> read_and_call(uhandle, consumer.query_info, start='Query=')
> read_and_call_until(uhandle, consumer.query_info, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the database lines and the following blank line.
> read_and_call_until(uhandle, consumer.database_info, end='total letters')
> read_and_call(uhandle, consumer.database_info, contains='sequences')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_header()
>
> def _scan_rounds(self, uhandle, consumer):
> # Scan a bunch of rounds.
> # Each round begins with either a "Searching......" line
> # or a 'Score E' line followed by descriptions and alignments.
> # The email server doesn't give the "Searching....." line.
> # If there is no 'Searching.....' line then you'll first see a
> # 'Results from round' line
>
> while 1:
> line = safe_peekline(uhandle)
> if (not line.startswith('Searching') and
> not line.startswith('Results from round') and
> re.search(r"Score +E", line) is None and
> line.find('No hits found') == -1):
> break
>
> self._scan_descriptions(uhandle, consumer)
> self._scan_alignments(uhandle, consumer)
>
> def _scan_descriptions(self, uhandle, consumer):
> # Searching..................................................done
> # Results from round 2
> #
> #
> # Sc
> # Sequences producing significant alignments: (b
> # Sequences used in model and found again:
> #
> # d1tde_2 3.4.1.4.4 (119-244) Thioredoxin reductase [Escherichia ...
> # d1tcob_ 1.31.1.5.16 Calcineurin regulatory subunit (B-chain) [B...
> # d1symb_ 1.31.1.2.2 Calcyclin (S100) [RAT (RATTUS NORVEGICUS)]
> #
> # Sequences not found previously or not previously below threshold:
> #
> # d1osa__ 1.31.1.5.11 Calmodulin [Paramecium tetraurelia]
> # d1aoza3 2.5.1.3.3 (339-552) Ascorbate oxidase [zucchini (Cucurb...
> #
>
> # If PSI-BLAST, may also have:
> #
> # CONVERGED!
>
> consumer.start_descriptions()
>
> # Read 'Searching'
> # This line seems to be missing in BLASTN 2.1.2 (others?)
> attempt_read_and_call(uhandle, consumer.noevent, start='Searching')
>
> # blastpgp 2.0.10 from NCBI 9/19/99 for Solaris sometimes crashes here.
> # If this happens, the handle will yield no more information.
> if not uhandle.peekline():
> raise SyntaxError, "Unexpected end of blast report. " + \
> "Looks suspiciously like a PSI-BLAST crash."
>
> # BLASTN 2.2.3 sometimes spews a bunch of warnings and errors here:
> # Searching[blastall] WARNING: [000.000] AT1G08320: SetUpBlastSearch
> # [blastall] ERROR: [000.000] AT1G08320: Blast:
> # [blastall] ERROR: [000.000] AT1G08320: Blast: Query must be at leas
> # done
> # Reported by David Weisman.
> # Check for these error lines and ignore them for now. Let
> # the BlastErrorParser deal with them.
> line = uhandle.peekline()
> if line.find("ERROR:") != -1 or line.startswith("done"):
> read_and_call_while(uhandle, consumer.noevent, contains="ERROR:")
> read_and_call(uhandle, consumer.noevent, start="done")
>
> # Check to see if this is PSI-BLAST.
> # If it is, the 'Searching' line will be followed by:
> # (version 2.0.10)
> # Searching.............................
> # Results from round 2
> # or (version 2.0.11)
> # Searching.............................
> #
> #
> # Results from round 2
>
> # Skip a bunch of blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> # Check for the results line if it's there.
> if attempt_read_and_call(uhandle, consumer.round, start='Results'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Three things can happen here:
> # 1. line contains 'Score E'
> # 2. line contains "No hits found"
> # 3. no descriptions
> # The first one begins a bunch of descriptions. The last two
> # indicates that no descriptions follow, and we should go straight
> # to the alignments.
> if not attempt_read_and_call(
> uhandle, consumer.description_header,
> has_re=re.compile(r'Score +E')):
> # Either case 2 or 3. Look for "No hits found".
> attempt_read_and_call(uhandle, consumer.no_hits,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> #Psiblast can repeat the Searching...No hits found section
> if attempt_read_and_call(uhandle, consumer.noevent,
> start='Searching'):
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> read_and_call(uhandle, consumer.noevent,
> contains='No hits found')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
> # Stop processing.
> return
>
> # Read the score header lines
> read_and_call(uhandle, consumer.description_header,
> start='Sequences producing')
>
> # If PSI-BLAST, read the 'Sequences used in model' line.
> attempt_read_and_call(uhandle, consumer.model_sequences,
> start='Sequences used in model')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # Read the descriptions and the following blank lines, making
> # sure that there are descriptions.
> if not uhandle.peekline().startswith('Sequences not found'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> # If PSI-BLAST, read the 'Sequences not found' line followed
> # by more descriptions. However, I need to watch out for the
> # case where there were no sequences not found previously, in
> # which case there will be no more descriptions.
> if attempt_read_and_call(uhandle, consumer.nonmodel_sequences,
> start='Sequences not found'):
> # Read the descriptions and the following blank lines.
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> l = safe_peekline(uhandle)
> # Brad -- added check for QUERY. On some PSI-BLAST outputs
> # there will be a 'Sequences not found' line followed by no
> # descriptions. Check for this case since the first thing you'll
> # get is a blank line and then 'QUERY'
> if not l.startswith('CONVERGED') and l[0] != '>' \
> and not l.startswith('QUERY'):
> read_and_call_until(uhandle, consumer.description, blank=1)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> attempt_read_and_call(uhandle, consumer.converged, start='CONVERGED')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
>
> consumer.end_descriptions()
>
> def _scan_alignments(self, uhandle, consumer):
> # qblast inserts a helpful line here.
> attempt_read_and_call(uhandle, consumer.noevent, start="ALIGNMENTS")
>
> # First, check to see if I'm at the database report.
> line = safe_peekline(uhandle)
> if line.startswith(' Database'):
> return
> elif line[0] == '>':
> # XXX make a better check here between pairwise and masterslave
> self._scan_pairwise_alignments(uhandle, consumer)
> else:
> # XXX put in a check to make sure I'm in a masterslave alignment
> self._scan_masterslave_alignment(uhandle, consumer)
>
> def _scan_pairwise_alignments(self, uhandle, consumer):
> while 1:
> line = safe_peekline(uhandle)
> if line[0] != '>':
> break
> self._scan_one_pairwise_alignment(uhandle, consumer)
>
> def _scan_one_pairwise_alignment(self, uhandle, consumer):
> consumer.start_alignment()
>
> self._scan_alignment_header(uhandle, consumer)
>
> # Scan a bunch of score/alignment pairs.
> while 1:
> line = safe_peekline(uhandle)
> if not line.startswith(' Score'):
> break
> self._scan_hsp(uhandle, consumer)
> consumer.end_alignment()
>
> def _scan_alignment_header(self, uhandle, consumer):
> # >d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus
> # stearothermophilus]
> # Length = 81
> #
> read_and_call(uhandle, consumer.title, start='>')
> while 1:
> line = safe_readline(uhandle)
> if line.lstrip().startswith('Length ='):
> consumer.length(line)
> break
> elif is_blank_line(line):
> # Check to make sure I haven't missed the Length line
> raise SyntaxError, "I missed the Length in an alignment header"
> consumer.title(line)
>
> # Older versions of BLAST will have a line with some spaces.
> # Version 2.0.14 (maybe 2.0.13?) and above print a true blank line.
> if not attempt_read_and_call(uhandle, consumer.noevent,
> start=' '):
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp(self, uhandle, consumer):
> consumer.start_hsp()
> self._scan_hsp_header(uhandle, consumer)
> self._scan_hsp_alignment(uhandle, consumer)
> consumer.end_hsp()
>
> def _scan_hsp_header(self, uhandle, consumer):
> # Score = 22.7 bits (47), Expect = 2.5
> # Identities = 10/36 (27%), Positives = 18/36 (49%)
> # Strand = Plus / Plus
> # Frame = +3
> #
>
> read_and_call(uhandle, consumer.score, start=' Score')
> read_and_call(uhandle, consumer.identities, start=' Identities')
> # BLASTN
> attempt_read_and_call(uhandle, consumer.strand, start = ' Strand')
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frame, start = ' Frame')
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> def _scan_hsp_alignment(self, uhandle, consumer):
> # Query: 11 GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
> # GRGVS+ TC Y + + V GGG+ + EE L + I R+
> # Sbjct: 12 GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
> #
> # Query: 64 AEKILIKR 71
> # I +K
> # Sbjct: 70 PNIIQLKD 77
> #
>
> while 1:
> # Blastn adds an extra line filled with spaces before Query
> attempt_read_and_call(uhandle, consumer.noevent, start=' ')
> read_and_call(uhandle, consumer.query, start='Query')
> read_and_call(uhandle, consumer.align, start=' ')
> read_and_call(uhandle, consumer.sbjct, start='Sbjct')
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> line = safe_peekline(uhandle)
> # Alignment continues if I see a 'Query' or the spaces for Blastn.
> if not (line.startswith('Query') or line.startswith(' ')):
> break
>
> def _scan_masterslave_alignment(self, uhandle, consumer):
> consumer.start_alignment()
> while 1:
> line = safe_readline(uhandle)
> # Check to see whether I'm finished reading the alignment.
> # This is indicated by 1) database section, 2) next psi-blast
> # round, which can also be a 'Results from round' if no
> # searching line is present
> # patch by chapmanb
> if line.startswith('Searching') or \
> line.startswith('Results from round'):
> uhandle.saveline(line)
> break
> elif line.startswith(' Database'):
> uhandle.saveline(line)
> break
> elif is_blank_line(line):
> consumer.noevent(line)
> else:
> consumer.multalign(line)
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> consumer.end_alignment()
>
> def _scan_database_report(self, uhandle, consumer):
> # Database: sdqib40-1.35.seg.fa
> # Posted date: Nov 1, 1999 4:25 PM
> # Number of letters in database: 223,339
> # Number of sequences in database: 1323
> #
> # Lambda K H
> # 0.322 0.133 0.369
> #
> # Gapped
> # Lambda K H
> # 0.270 0.0470 0.230
> #
>
> consumer.start_database_report()
>
> # Subset of the database(s) listed below
> # Number of letters searched: 562,618,960
> # Number of sequences searched: 228,924
> if attempt_read_and_call(uhandle, consumer.noevent, start=" Subset"):
> read_and_call(uhandle, consumer.noevent, contains="letters")
> read_and_call(uhandle, consumer.noevent, contains="sequences")
> read_and_call(uhandle, consumer.noevent, start=" ")
>
> # Sameet Mehta reported seeing output from BLASTN 2.2.9 that
> # was missing the "Database" stanza completely.
> while attempt_read_and_call(uhandle, consumer.database,
> start=' Database'):
> # BLAT output ends abruptly here, without any of the other
> # information. Check to see if this is the case. If so,
> # then end the database report here gracefully.
> if not uhandle.peekline():
> consumer.end_database_report()
> return
>
> # Database can span multiple lines.
> read_and_call_until(uhandle, consumer.database, start=' Posted')
> read_and_call(uhandle, consumer.posted_date, start=' Posted')
> read_and_call(uhandle, consumer.num_letters_in_database,
> start=' Number of letters')
> read_and_call(uhandle, consumer.num_sequences_in_database,
> start=' Number of sequences')
> read_and_call(uhandle, consumer.noevent, start=' ')
>
> line = safe_readline(uhandle)
> uhandle.saveline(line)
> if line.find('Lambda') != -1:
> break
>
> read_and_call(uhandle, consumer.noevent, start='Lambda')
> read_and_call(uhandle, consumer.ka_params)
> read_and_call(uhandle, consumer.noevent, blank=1)
>
> # not BLASTP
> attempt_read_and_call(uhandle, consumer.gapped, start='Gapped')
> # not TBLASTX
> if attempt_read_and_call(uhandle, consumer.noevent, start='Lambda'):
> read_and_call(uhandle, consumer.ka_params_gap)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, I need to be careful not to read past the end of the
> # file.
> try:
> read_and_call_while(uhandle, consumer.noevent, blank=1)
> except SyntaxError, x:
> if str(x) != "Unexpected end of stream.":
> raise
> consumer.end_database_report()
>
> def _scan_parameters(self, uhandle, consumer):
> # Matrix: BLOSUM62
> # Gap Penalties: Existence: 11, Extension: 1
> # Number of Hits to DB: 50604
> # Number of Sequences: 1323
> # Number of extensions: 1526
> # Number of successful extensions: 6
> # Number of sequences better than 10.0: 5
> # Number of HSP's better than 10.0 without gapping: 5
> # Number of HSP's successfully gapped in prelim test: 0
> # Number of HSP's that attempted gapping in prelim test: 1
> # Number of HSP's gapped (non-prelim): 5
> # length of query: 140
> # length of database: 223,339
> # effective HSP length: 39
> # effective length of query: 101
> # effective length of database: 171,742
> # effective search space: 17345942
> # effective search space used: 17345942
> # T: 11
> # A: 40
> # X1: 16 ( 7.4 bits)
> # X2: 38 (14.8 bits)
> # X3: 64 (24.9 bits)
> # S1: 41 (21.9 bits)
> # S2: 42 (20.8 bits)
>
> # Blast 2.2.4 can sometimes skip the whole parameter section.
> # Thus, check to make sure that the parameter section really
> # exists.
> if not uhandle.peekline():
> return
>
> # BLASTN 2.2.9 looks like it reverses the "Number of Hits" and
> # "Number of Sequences" lines.
> consumer.start_parameters()
>
> # Matrix line may be missing in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.matrix, start='Matrix')
> # not TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_penalties, start='Gap')
>
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_hits,
> start='Number of Hits')
> attempt_read_and_call(uhandle, consumer.num_sequences,
> start='Number of Sequences')
> read_and_call(uhandle, consumer.num_extends,
> start='Number of extensions')
> read_and_call(uhandle, consumer.num_good_extends,
> start='Number of successful')
>
> read_and_call(uhandle, consumer.num_seqs_better_e,
> start='Number of sequences')
>
> # not BLASTN, TBLASTX
> if attempt_read_and_call(uhandle, consumer.hsps_no_gap,
> start="Number of HSP's better"):
> # BLASTN 2.2.9
> if attempt_read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's gapped:"):
> read_and_call(uhandle, consumer.noevent,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.noevent,
> start="Number of extra gapped extensions")
> else:
> read_and_call(uhandle, consumer.hsps_prelim_gapped,
> start="Number of HSP's successfully")
> read_and_call(uhandle, consumer.hsps_prelim_gap_attempted,
> start="Number of HSP's that")
> read_and_call(uhandle, consumer.hsps_gapped,
> start="Number of HSP's gapped")
> # not in blastx 2.2.1
> attempt_read_and_call(uhandle, consumer.query_length,
> has_re=re.compile(r"[Ll]ength of query"))
> read_and_call(uhandle, consumer.database_length,
> has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase"))
>
> # BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.noevent,
> start="Length adjustment")
> attempt_read_and_call(uhandle, consumer.effective_hsp_length,
> start='effective HSP')
> # Not in blastx 2.2.1
> attempt_read_and_call(
> uhandle, consumer.effective_query_length,
> has_re=re.compile(r'[Ee]ffective length of query'))
> read_and_call(
> uhandle, consumer.effective_database_length,
> has_re=re.compile(r'[Ee]ffective length of \s*[Dd]atabase'))
> # Not in blastx 2.2.1, added a ':' to distinguish between
> # this and the 'effective search space used' line
> attempt_read_and_call(
> uhandle, consumer.effective_search_space,
> has_re=re.compile(r'[Ee]ffective search space:'))
> # Does not appear in BLASTP 2.0.5
> attempt_read_and_call(
> uhandle, consumer.effective_search_space_used,
> has_re=re.compile(r'[Ee]ffective search space used'))
>
> # BLASTX, TBLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.frameshift, start='frameshift')
> # not in BLASTN 2.2.9
> attempt_read_and_call(uhandle, consumer.threshold, start='T')
> read_and_call(uhandle, consumer.window_size, start='A')
> read_and_call(uhandle, consumer.dropoff_1st_pass, start='X1')
> read_and_call(uhandle, consumer.gap_x_dropoff, start='X2')
> # not BLASTN, TBLASTX
> attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final,
> start='X3')
> read_and_call(uhandle, consumer.gap_trigger, start='S1')
> # not in blastx 2.2.1
> # first we make sure we have additional lines to work with, if
> # not then the file is done and we don't have a final S2
> if not is_blank_line(uhandle.peekline(), allow_spaces=1):
> read_and_call(uhandle, consumer.blast_cutoff, start='S2')
>
> consumer.end_parameters()
>
> class BlastParser(AbstractParser):
> """Parses BLAST data into a Record.Blast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _BlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class PSIBlastParser(AbstractParser):
> """Parses BLAST data into a Record.PSIBlast object.
>
> """
> def __init__(self):
> """__init__(self)"""
> self._scanner = _Scanner()
> self._consumer = _PSIBlastConsumer()
>
> def parse(self, handle):
> """parse(self, handle)"""
> self._scanner.feed(handle, self._consumer)
> return self._consumer.data
>
> class _HeaderConsumer:
> def start_header(self):
> self._header = Record.Header()
>
> def version(self, line):
> c = line.split()
> self._header.application = c[0]
> self._header.version = c[1]
> self._header.date = c[2][1:-1]
>
> def reference(self, line):
> if line.startswith('Reference: '):
> self._header.reference = line[11:]
> else:
> self._header.reference = self._header.reference + line
>
> def query_info(self, line):
> if line.startswith('Query= '):
> self._header.query = line[7:]
> elif not line.startswith(' '): # continuation of query_info
> self._header.query = "%s%s" % (self._header.query, line)
> else:
> letters, = _re_search(
> r"([0-9,]+) letters", line,
> "I could not find the number of letters in line\n%s" % line)
> self._header.query_letters = _safe_int(letters)
>
> def database_info(self, line):
> line = line.rstrip()
> if line.startswith('Database: '):
> self._header.database = line[10:]
> elif not line.endswith('total letters'):
> self._header.database = self._header.database + line.strip()
> else:
> sequences, letters =_re_search(
> r"([0-9,]+) sequences; ([0-9,-]+) total letters", line,
> "I could not find the sequences and letters in line\n%s" %line)
> self._header.database_sequences = _safe_int(sequences)
> self._header.database_letters = _safe_int(letters)
>
> def end_header(self):
> # Get rid of the trailing newlines
> self._header.reference = self._header.reference.rstrip()
> self._header.query = self._header.query.rstrip()
>
> class _DescriptionConsumer:
> def start_descriptions(self):
> self._descriptions = []
> self._model_sequences = []
> self._nonmodel_sequences = []
> self._converged = 0
> self._type = None
> self._roundnum = None
>
> self.__has_n = 0 # Does the description line contain an N value?
>
> def description_header(self, line):
> if line.startswith('Sequences producing'):
> cols = line.split()
> if cols[-1] == 'N':
> self.__has_n = 1
>
> def description(self, line):
> dh = self._parse(line)
> if self._type == 'model':
> self._model_sequences.append(dh)
> elif self._type == 'nonmodel':
> self._nonmodel_sequences.append(dh)
> else:
> self._descriptions.append(dh)
>
> def model_sequences(self, line):
> self._type = 'model'
>
> def nonmodel_sequences(self, line):
> self._type = 'nonmodel'
>
> def converged(self, line):
> self._converged = 1
>
> def no_hits(self, line):
> pass
>
> def round(self, line):
> if not line.startswith('Results from round'):
> raise SyntaxError, "I didn't understand the round line\n%s" % line
> self._roundnum = _safe_int(line[18:].strip())
>
> def end_descriptions(self):
> pass
>
> def _parse(self, description_line):
> line = description_line # for convenience
> dh = Record.Description()
>
> # I need to separate the score and p-value from the title.
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77
> # sp|P21297|FLBT_CAUCR FLBT PROTEIN [snip] 284 7e-77 1
> # special cases to handle:
> # - title must be preserved exactly (including whitespaces)
> # - score could be equal to e-value (not likely, but what if??)
> # - sometimes there's an "N" score of '1'.
> cols = line.split()
> if len(cols) < 3:
> raise SyntaxError, \
> "Line does not appear to contain description:\n%s" % line
> if self.__has_n:
> i = line.rfind(cols[-1]) # find start of N
> i = line.rfind(cols[-2], 0, i) # find start of p-value
> i = line.rfind(cols[-3], 0, i) # find start of score
> else:
> i = line.rfind(cols[-1]) # find start of p-value
> i = line.rfind(cols[-2], 0, i) # find start of score
> if self.__has_n:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-3], cols[-2], cols[-1]
> else:
> dh.title, dh.score, dh.e, dh.num_alignments = \
> line[:i].rstrip(), cols[-2], cols[-1], 1
> dh.num_alignments = _safe_int(dh.num_alignments)
> dh.score = _safe_int(dh.score)
> dh.e = _safe_float(dh.e)
> return dh
>
> class _AlignmentConsumer:
> # This is a little bit tricky. An alignment can either be a
> # pairwise alignment or a multiple alignment. Since it's difficult
> # to know a-priori which one the blast record will contain, I'm going
> # to make one class that can parse both of them.
> def start_alignment(self):
> self._alignment = Record.Alignment()
> self._multiple_alignment = Record.MultipleAlignment()
>
> def title(self, line):
> self._alignment.title = "%s%s" % (self._alignment.title,
> line.lstrip())
>
> def length(self, line):
> self._alignment.length = line.split()[2]
> self._alignment.length = _safe_int(self._alignment.length)
>
> def multalign(self, line):
> # Standalone version uses 'QUERY', while WWW version uses blast_tmp.
> if line.startswith('QUERY') or line.startswith('blast_tmp'):
> # If this is the first line of the multiple alignment,
> # then I need to figure out how the line is formatted.
>
> # Format of line is:
> # QUERY 1 acttg...gccagaggtggtttattcagtctccataagagaggggacaaacg 60
> try:
> name, start, seq, end = line.split()
> except ValueError:
> raise SyntaxError, "I do not understand the line\n%s" \
> % line
> self._start_index = line.index(start, len(name))
> self._seq_index = line.index(seq,
> self._start_index+len(start))
> # subtract 1 for the space
> self._name_length = self._start_index - 1
> self._start_length = self._seq_index - self._start_index - 1
> self._seq_length = line.rfind(end) - self._seq_index - 1
>
> #self._seq_index = line.index(seq)
> ## subtract 1 for the space
> #self._seq_length = line.rfind(end) - self._seq_index - 1
> #self._start_index = line.index(start)
> #self._start_length = self._seq_index - self._start_index - 1
> #self._name_length = self._start_index
>
> # Extract the information from the line
> name = line[:self._name_length]
> name = name.rstrip()
> start = line[self._start_index:self._start_index+self._start_length]
> start = start.rstrip()
> if start:
> start = _safe_int(start)
> end = line[self._seq_index+self._seq_length:].rstrip()
> if end:
> end = _safe_int(end)
> seq = line[self._seq_index:self._seq_index+self._seq_length].rstrip()
> # right pad the sequence with spaces if necessary
> if len(seq) < self._seq_length:
> seq = seq + ' '*(self._seq_length-len(seq))
>
> # I need to make sure the sequence is aligned correctly with the query.
> # First, I will find the length of the query. Then, if necessary,
> # I will pad my current sequence with spaces so that they will line
> # up correctly.
>
> # Two possible things can happen:
> # QUERY
> # 504
> #
> # QUERY
> # 403
> #
> # Sequence 504 will need padding at the end. Since I won't know
> # this until the end of the alignment, this will be handled in
> # end_alignment.
> # Sequence 403 will need padding before being added to the alignment.
>
> align = self._multiple_alignment.alignment # for convenience
> align.append((name, start, seq, end))
>
> # This is old code that tried to line up all the sequences
> # in a multiple alignment by using the sequence title's as
> # identifiers. The problem with this is that BLAST assigns
> # different HSP's from the same sequence the same id. Thus,
> # in one alignment block, there may be multiple sequences with
> # the same id. I'm not sure how to handle this, so I'm not
> # going to.
>
> # # If the sequence is the query, then just add it.
> # if name == 'QUERY':
> # if len(align) == 0:
> # align.append((name, start, seq))
> # else:
> # aname, astart, aseq = align[0]
> # if name != aname:
> # raise SyntaxError, "Query is not the first sequence"
> # aseq = aseq + seq
> # align[0] = aname, astart, aseq
> # else:
> # if len(align) == 0:
> # raise SyntaxError, "I could not find the query sequence"
> # qname, qstart, qseq = align[0]
> #
> # # Now find my sequence in the multiple alignment.
> # for i in range(1, len(align)):
> # aname, astart, aseq = align[i]
> # if name == aname:
> # index = i
> # break
> # else:
> # # If I couldn't find it, then add a new one.
> # align.append((None, None, None))
> # index = len(align)-1
> # # Make sure to left-pad it.
> # aname, astart, aseq = name, start, ' '*(len(qseq)-len(seq))
> #
> # if len(qseq) != len(aseq) + len(seq):
> # # If my sequences are shorter than the query sequence,
> # # then I will need to pad some spaces to make them line up.
> # # Since I've already right padded seq, that means aseq
> # # must be too short.
> # aseq = aseq + ' '*(len(qseq)-len(aseq)-len(seq))
> # aseq = aseq + seq
> # if astart is None:
> # astart = start
> # align[index] = aname, astart, aseq
>
> def end_alignment(self):
> # Remove trailing newlines
> if self._alignment:
> self._alignment.title = self._alignment.title.rstrip()
>
> # This code is also obsolete. See note above.
> # If there's a multiple alignment, I will need to make sure
> # all the sequences are aligned. That is, I may need to
> # right-pad the sequences.
> # if self._multiple_alignment is not None:
> # align = self._multiple_alignment.alignment
> # seqlen = None
> # for i in range(len(align)):
> # name, start, seq = align[i]
> # if seqlen is None:
> # seqlen = len(seq)
> # else:
> # if len(seq) < seqlen:
> # seq = seq + ' '*(seqlen - len(seq))
> # align[i] = name, start, seq
> # elif len(seq) > seqlen:
> # raise SyntaxError, \
> # "Sequence %s is longer than the query" % name
>
> # Clean up some variables, if they exist.
> try:
> del self._seq_index
> del self._seq_length
> del self._start_index
> del self._start_length
> del self._name_length
> except AttributeError:
> pass
>
> class _HSPConsumer:
> def start_hsp(self):
> self._hsp = Record.HSP()
>
> def score(self, line):
> self._hsp.bits, self._hsp.score = _re_search(
> r"Score =\s*([0-9.e+]+) bits \(([0-9]+)\)", line,
> "I could not find the score in line\n%s" % line)
> self._hsp.score = _safe_float(self._hsp.score)
> self._hsp.bits = _safe_float(self._hsp.bits)
>
> x, y = _re_search(
> r"Expect\(?(\d*)\)? = +([0-9.e\-|\+]+)", line,
> "I could not find the expect in line\n%s" % line)
> if x:
> self._hsp.num_alignments = _safe_int(x)
> else:
> self._hsp.num_alignments = 1
> self._hsp.expect = _safe_float(y)
>
> def identities(self, line):
> x, y = _re_search(
> r"Identities = (\d+)\/(\d+)", line,
> "I could not find the identities in line\n%s" % line)
> self._hsp.identities = _safe_int(x), _safe_int(y)
>
> if line.find('Positives') != -1:
> x, y = _re_search(
> r"Positives = (\d+)\/(\d+)", line,
> "I could not find the positives in line\n%s" % line)
> self._hsp.positives = _safe_int(x), _safe_int(y)
>
> if line.find('Gaps') != -1:
> x, y = _re_search(
> r"Gaps = (\d+)\/(\d+)", line,
> "I could not find the gaps in line\n%s" % line)
> self._hsp.gaps = _safe_int(x), _safe_int(y)
>
>
> def strand(self, line):
> self._hsp.strand = _re_search(
> r"Strand = (\w+) / (\w+)", line,
> "I could not find the strand in line\n%s" % line)
>
> def frame(self, line):
> # Frame can be in formats:
> # Frame = +1
> # Frame = +2 / +2
> if line.find('/') != -1:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123]) / ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
> else:
> self._hsp.frame = _re_search(
> r"Frame = ([-+][123])", line,
> "I could not find the frame in line\n%s" % line)
>
> # Match a space, if one is available. Masahir Ishikawa found a
> # case where there's no space between the start and the sequence:
> # Query: 100tt 101
> # line below modified by Yair Benita, Sep 2004
> _query_re = re.compile(r"Query: \s*(\d+)\s*(.+) (\d+)")
> def query(self, line):
> m = self._query_re.search(line)
> if m is None:
> raise SyntaxError, "I could not find the query in line\n%s" % line
>
> # line below modified by Yair Benita, Sep 2004.
> # added the end attribute for the query
> start, seq, end = m.groups()
> self._hsp.query = self._hsp.query + seq
> if self._hsp.query_start is None:
> self._hsp.query_start = _safe_int(start)
>
> # line below added by Yair Benita, Sep 2004.
> # added the end attribute for the query
> self._hsp.query_end = _safe_int(end)
> self._query_start_index = m.start(2)
> self._query_len = len(seq)
>
> def align(self, line):
> seq = line[self._query_start_index:].rstrip()
> if len(seq) < self._query_len:
> # Make sure the alignment is the same length as the query
> seq = seq + ' ' * (self._query_len-len(seq))
> elif len(seq) < self._query_len:
> raise SyntaxError, "Match is longer than the query in line\n%s" % \
> line
> self._hsp.match = self._hsp.match + seq
>
> def sbjct(self, line):
> # line below modified by Yair Benita, Sep 2004
> # added the end group and the -? to allow parsing
> # of BLAT output in BLAST format.
> start, seq, end = _re_search(
> r"Sbjct: (-?\d+)\s*(.+) (-?\d+)", line,
> "I could not find the sbjct in line\n%s" % line)
> #mikep 26/9/00
> #On occasion, there is a blast hit with no subject match
> #so far, it only occurs with 1-line short "matches"
> #I have decided to let these pass as they appear
> if not seq.strip():
> seq = ' ' * self._query_len
> self._hsp.sbjct = self._hsp.sbjct + seq
> if self._hsp.sbjct_start is None:
> self._hsp.sbjct_start = _safe_int(start)
>
> self._hsp.sbjct_end = _safe_int(end)
> if len(seq) != self._query_len:
> raise SyntaxError, \
> "QUERY and SBJCT sequence lengths don't match in line\n%s" \
> % line
>
> del self._query_start_index # clean up unused variables
> del self._query_len
>
> def end_hsp(self):
> pass
>
> class _DatabaseReportConsumer:
>
> def start_database_report(self):
> self._dr = Record.DatabaseReport()
>
> def database(self, line):
> m = re.search(r"Database: (.+)$", line)
> if m:
> self._dr.database_name.append(m.group(1))
> elif self._dr.database_name:
> # This must be a continuation of the previous name.
> self._dr.database_name[-1] = "%s%s" % (self._dr.database_name[-1],
> line.strip())
>
> def posted_date(self, line):
> self._dr.posted_date.append(_re_search(
> r"Posted date:\s*(.+)$", line,
> "I could not find the posted date in line\n%s" % line))
>
> def num_letters_in_database(self, line):
> letters, = _get_cols(
> line, (-1,), ncols=6, expected={2:"letters", 4:"database:"})
> self._dr.num_letters_in_database.append(_safe_int(letters))
>
> def num_sequences_in_database(self, line):
> sequences, = _get_cols(
> line, (-1,), ncols=6, expected={2:"sequences", 4:"database:"})
> self._dr.num_sequences_in_database.append(_safe_int(sequences))
>
> def ka_params(self, line):
> x = line.split()
> self._dr.ka_params = map(_safe_float, x)
>
> def gapped(self, line):
> self._dr.gapped = 1
>
> def ka_params_gap(self, line):
> x = line.split()
> self._dr.ka_params_gap = map(_safe_float, x)
>
> def end_database_report(self):
> pass
>
> class _ParametersConsumer:
> def start_parameters(self):
> self._params = Record.Parameters()
>
> def matrix(self, line):
> self._params.matrix = line[8:].rstrip()
>
> def gap_penalties(self, line):
> x = _get_cols(
> line, (3, 5), ncols=6, expected={2:"Existence:", 4:"Extension:"})
> self._params.gap_penalties = map(_safe_float, x)
>
> def num_hits(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=11, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=6, expected={2:"Hits"})
> self._params.num_hits = _safe_int(x)
>
> def num_sequences(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"Sequences:"})
> self._params.num_sequences = _safe_int(x)
>
> def num_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=9, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=4, expected={2:"extensions:"})
> self._params.num_extends = _safe_int(x)
>
> def num_good_extends(self, line):
> if line.find('1st pass') != -1:
> x, = _get_cols(line, (-4,), ncols=10, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
> else:
> x, = _get_cols(line, (-1,), ncols=5, expected={3:"extensions:"})
> self._params.num_good_extends = _safe_int(x)
>
> def num_seqs_better_e(self, line):
> self._params.num_seqs_better_e, = _get_cols(
> line, (-1,), ncols=7, expected={2:"sequences"})
> self._params.num_seqs_better_e = _safe_int(
> self._params.num_seqs_better_e)
>
> def hsps_no_gap(self, line):
> self._params.hsps_no_gap, = _get_cols(
> line, (-1,), ncols=9, expected={3:"better", 7:"gapping:"})
> self._params.hsps_no_gap = _safe_int(self._params.hsps_no_gap)
>
> def hsps_prelim_gapped(self, line):
> self._params.hsps_prelim_gapped, = _get_cols(
> line, (-1,), ncols=9, expected={4:"gapped", 6:"prelim"})
> self._params.hsps_prelim_gapped = _safe_int(
> self._params.hsps_prelim_gapped)
>
> def hsps_prelim_gapped_attempted(self, line):
> self._params.hsps_prelim_gapped_attempted, = _get_cols(
> line, (-1,), ncols=10, expected={4:"attempted", 7:"prelim"})
> self._params.hsps_prelim_gapped_attempted = _safe_int(
> self._params.hsps_prelim_gapped_attempted)
>
> def hsps_gapped(self, line):
> self._params.hsps_gapped, = _get_cols(
> line, (-1,), ncols=6, expected={3:"gapped"})
> self._params.hsps_gapped = _safe_int(self._params.hsps_gapped)
>
> def query_length(self, line):
> self._params.query_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"query:"})
> self._params.query_length = _safe_int(self._params.query_length)
>
> def database_length(self, line):
> self._params.database_length, = _get_cols(
> line.lower(), (-1,), ncols=4, expected={0:"length", 2:"database:"})
> self._params.database_length = _safe_int(self._params.database_length)
>
> def effective_hsp_length(self, line):
> self._params.effective_hsp_length, = _get_cols(
> line, (-1,), ncols=4, expected={1:"HSP", 2:"length:"})
> self._params.effective_hsp_length = _safe_int(
> self._params.effective_hsp_length)
>
> def effective_query_length(self, line):
> self._params.effective_query_length, = _get_cols(
> line, (-1,), ncols=5, expected={1:"length", 3:"query:"})
> self._params.effective_query_length = _safe_int(
> self._params.effective_query_length)
>
> def effective_database_length(self, line):
> self._params.effective_database_length, = _get_cols(
> line.lower(), (-1,), ncols=5, expected={1:"length", 3:"database:"})
> self._params.effective_database_length = _safe_int(
> self._params.effective_database_length)
>
> def effective_search_space(self, line):
> self._params.effective_search_space, = _get_cols(
> line, (-1,), ncols=4, expected={1:"search"})
> self._params.effective_search_space = _safe_int(
> self._params.effective_search_space)
>
> def effective_search_space_used(self, line):
> self._params.effective_search_space_used, = _get_cols(
> line, (-1,), ncols=5, expected={1:"search", 3:"used:"})
> self._params.effective_search_space_used = _safe_int(
> self._params.effective_search_space_used)
>
> def frameshift(self, line):
> self._params.frameshift = _get_cols(
> line, (4, 5), ncols=6, expected={0:"frameshift", 2:"decay"})
>
> def threshold(self, line):
> self._params.threshold, = _get_cols(
> line, (1,), ncols=2, expected={0:"T:"})
> self._params.threshold = _safe_int(self._params.threshold)
>
> def window_size(self, line):
> self._params.window_size, = _get_cols(
> line, (1,), ncols=2, expected={0:"A:"})
> self._params.window_size = _safe_int(self._params.window_size)
>
> def dropoff_1st_pass(self, line):
> score, bits = _re_search(
> r"X1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the dropoff in line\n%s" % line)
> self._params.dropoff_1st_pass = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff(self, line):
> score, bits = _re_search(
> r"X2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff in line\n%s" % line)
> self._params.gap_x_dropoff = _safe_int(score), _safe_float(bits)
>
> def gap_x_dropoff_final(self, line):
> score, bits = _re_search(
> r"X3: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap dropoff final in line\n%s" % line)
> self._params.gap_x_dropoff_final = _safe_int(score), _safe_float(bits)
>
> def gap_trigger(self, line):
> score, bits = _re_search(
> r"S1: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the gap trigger in line\n%s" % line)
> self._params.gap_trigger = _safe_int(score), _safe_float(bits)
>
> def blast_cutoff(self, line):
> score, bits = _re_search(
> r"S2: (\d+) \(\s*([0-9,.]+) bits\)", line,
> "I could not find the blast cutoff in line\n%s" % line)
> self._params.blast_cutoff = _safe_int(score), _safe_float(bits)
>
> def end_parameters(self):
> pass
>
>
> class _BlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> # This Consumer is inherits from many other consumer classes that handle
> # the actual dirty work. An alternate way to do it is to create objects
> # of those classes and then delegate the parsing tasks to them in a
> # decorator-type pattern. The disadvantage of that is that the method
> # names will need to be resolved in this classes. However, using
> # a decorator will retain more control in this class (which may or
> # may not be a bad thing). In addition, having each sub-consumer as
> # its own object prevents this object's dictionary from being cluttered
> # with members and reduces the chance of member collisions.
> def __init__(self):
> self.data = None
>
> def round(self, line):
> # Make sure nobody's trying to pass me PSI-BLAST data!
> raise ValueError, \
> "This consumer doesn't handle PSI-BLAST data"
>
> def start_header(self):
> self.data = Record.Blast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def end_descriptions(self):
> self.data.descriptions = self._descriptions
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self.data.alignments.append(self._alignment)
> if self._multiple_alignment.alignment:
> self.data.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class _PSIBlastConsumer(AbstractConsumer,
> _HeaderConsumer,
> _DescriptionConsumer,
> _AlignmentConsumer,
> _HSPConsumer,
> _DatabaseReportConsumer,
> _ParametersConsumer
> ):
> def __init__(self):
> self.data = None
>
> def start_header(self):
> self.data = Record.PSIBlast()
> _HeaderConsumer.start_header(self)
>
> def end_header(self):
> _HeaderConsumer.end_header(self)
> self.data.__dict__.update(self._header.__dict__)
>
> def start_descriptions(self):
> self._round = Record.Round()
> self.data.rounds.append(self._round)
> _DescriptionConsumer.start_descriptions(self)
>
> def end_descriptions(self):
> _DescriptionConsumer.end_descriptions(self)
> self._round.number = self._roundnum
> if self._descriptions:
> self._round.new_seqs.extend(self._descriptions)
> self._round.reused_seqs.extend(self._model_sequences)
> self._round.new_seqs.extend(self._nonmodel_sequences)
> if self._converged:
> self.data.converged = 1
>
> def end_alignment(self):
> _AlignmentConsumer.end_alignment(self)
> if self._alignment.hsps:
> self._round.alignments.append(self._alignment)
> if self._multiple_alignment:
> self._round.multiple_alignment = self._multiple_alignment
>
> def end_hsp(self):
> _HSPConsumer.end_hsp(self)
> try:
> self._alignment.hsps.append(self._hsp)
> except AttributeError:
> raise SyntaxError, "Found an HSP before an alignment"
>
> def end_database_report(self):
> _DatabaseReportConsumer.end_database_report(self)
> self.data.__dict__.update(self._dr.__dict__)
>
> def end_parameters(self):
> _ParametersConsumer.end_parameters(self)
> self.data.__dict__.update(self._params.__dict__)
>
> class Iterator:
> """Iterates over a file of multiple BLAST results.
>
> Methods:
> next Return the next record from the stream, or None.
>
> """
> def __init__(self, handle, parser=None):
> """__init__(self, handle, parser=None)
>
> Create a new iterator. handle is a file-like object. parser
> is an optional Parser object to change the results into another form.
> If set to None, then the raw contents of the file will be returned.
>
> """
> try:
> handle.readline
> except AttributeError:
> raise ValueError(
> "I expected a file handle or file-like object, got %s"
> % type(handle))
> self._uhandle = File.UndoHandle(handle)
> self._parser = parser
>
> def next(self):
> """next(self) -> object
>
> Return the next Blast record from the file. If no more records,
> return None.
>
> """
> lines = []
> while 1:
> line = self._uhandle.readline()
> if not line:
> break
> # If I've reached the next one, then put the line back and stop.
> if lines and (line.startswith('BLAST')
> or line.startswith('BLAST', 1)
> or line.startswith(' self._uhandle.saveline(line)
> break
> lines.append(line)
>
> if not lines:
> return None
>
> data = ''.join(lines)
> if self._parser is not None:
> return self._parser.parse(File.StringHandle(data))
> return data
>
> def __iter__(self):
> return iter(self.next, None)
>
> def blastall(blastcmd, program, database, infile, **keywds):
> """blastall(blastcmd, program, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastall. blastcmd is the command
> used to launch the 'blastall' executable. program is the blast program
> to use, e.g. 'blastp', 'blastn', etc. database is the path to the database
> to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastall.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> nuc_match Nucleotide match reward. (BLASTN)
> nuc_mismatch Nucleotide mismatch penalty. (BLASTN)
> query_genetic_code Genetic code for Query.
> db_genetic_code Genetic code for database. (TBLAST[NX])
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F (not for TBLASTX)
> expectation Expectation value cutoff.
> wordsize Word size.
> strands Query strands to search against database.([T]BLAST[NX])
> keep_hits Number of best hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
>
> Processing
> filter Filter query sequence? T/F
> believe_query Believe the query defline. T/F
> restrict_gi Restrict search to these GI's.
> nprocessors Number of processors to use.
> oldengine Force use of old engine [T/F]
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'nuc_match' : '-r',
> 'nuc_mismatch' : '-q',
> 'query_genetic_code' : '-Q',
> 'db_genetic_code' : '-D',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'strands' : '-S',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-z',
> 'search_length' : '-Y',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'restrict_gi' : '-l',
> 'nprocessors' : '-a',
> 'oldengine' : '-V',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastall does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['program'], program])
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def blastpgp(blastcmd, database, infile, **keywds):
> """blastpgp(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from blastpgp. blastcmd is the command
> used to launch the 'blastpgp' executable. database is the path to the
> database to search against. infile is the path to the file containing
> the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by blastpgp.
>
> Scoring
> matrix Matrix to use.
> gap_open Gap open penalty.
> gap_extend Gap extension penalty.
> window_size Multiple hits window size.
> npasses Number of passes.
> passes Hits/passes. Integer 0-2.
>
> Algorithm
> gapped Whether to do a gapped alignment. T/F
> expectation Expectation value cutoff.
> wordsize Word size.
> keep_hits Number of beset hits from a region to keep.
> xdrop Dropoff value (bits) for gapped alignments.
> hit_extend Threshold for extending hits.
> region_length Length of region used to judge hits.
> db_length Effective database length.
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> pseudocounts Pseudocounts constants for multiple passes.
> xdrop_final X dropoff for final gapped alignment.
> xdrop_extension Dropoff for blast extensions.
> model_threshold E-value threshold to include in multipass model.
> required_start Start of required region in query.
> required_end End of required region in query.
>
> Processing
> XXX should document default values
> program The blast program to use. (PHI-BLAST)
> filter Filter query sequence with SEG? T/F
> believe_query Believe the query defline? T/F
> nprocessors Number of processors to use.
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-6.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
> checkpoint_outfile Output file for PSI-BLAST checkpointing.
> restart_infile Input file for PSI-BLAST restart.
> hit_infile Hit file for PHI-BLAST.
> matrix_outfile Output file for PSI-BLAST matrix in ASCII.
> align_infile Input alignment file for PSI-BLAST restart.
>
> """
> att2param = {
> 'matrix' : '-M',
> 'gap_open' : '-G',
> 'gap_extend' : '-E',
> 'window_size' : '-A',
> 'npasses' : '-j',
> 'passes' : '-P',
>
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'wordsize' : '-W',
> 'keep_hits' : '-K',
> 'xdrop' : '-X',
> 'hit_extend' : '-f',
> 'region_length' : '-L',
> 'db_length' : '-Z',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'pseudocounts' : '-c',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'model_threshold' : '-h',
> 'required_start' : '-S',
> 'required_end' : '-H',
>
> 'program' : '-p',
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o',
> 'checkpoint_outfile' : '-C',
> 'restart_infile' : '-R',
> 'hit_infile' : '-k',
> 'matrix_outfile' : '-Q',
> 'align_infile' : '-B'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "blastpgp does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
>
> def rpsblast(blastcmd, database, infile, align_view="7", **keywds):
> """rpsblast(blastcmd, database, infile, **keywds) ->
> read, error Undohandles
>
> Execute and retrieve data from standalone RPS-BLAST. blastcmd is the
> command used to launch the 'rpsblast' executable. database is the path
> to the database to search against. infile is the path to the file
> containing the sequence to search with.
>
> You may pass more parameters to **keywds to change the behavior of
> the search. Otherwise, optional values will be chosen by rpsblast.
>
> Please note that this function will give XML output by default, by
> setting align_view to seven (i.e. command line option -m 7).
> You should use the NCBIXML.BlastParser() to read the resulting output.
> This is because NCBIStandalone.BlastParser() does not understand the
> plain text output format from rpsblast.
>
> WARNING - The following text and associated parameter handling has not
> received extensive testing. Please report any errors we might have made...
>
> Algorithm/Scoring
> gapped Whether to do a gapped alignment. T/F
> multihit 0 for multiple hit (default), 1 for single hit
> expectation Expectation value cutoff.
> range_restriction Range restriction on query sequence (Format: start,stop) blastp only
> 0 in 'start' refers to the beginning of the sequence
> 0 in 'stop' refers to the end of the sequence
> Default = 0,0
> xdrop Dropoff value (bits) for gapped alignments.
> xdrop_final X dropoff for final gapped alignment (in bits).
> xdrop_extension Dropoff for blast extensions (in bits).
> search_length Effective length of search space.
> nbits_gapping Number of bits to trigger gapping.
> protein Query sequence is protein. T/F
> db_length Effective database length.
>
> Processing
> filter Filter query sequence with SEG? T/F
> case_filter Use lower case filtering of FASTA sequence T/F, default F
> believe_query Believe the query defline. T/F
> nprocessors Number of processors to use.
> logfile Name of log file to use, default rpsblast.log
>
> Formatting
> html Produce HTML output? T/F
> descriptions Number of one-line descriptions.
> alignments Number of alignments.
> align_view Alignment view. Integer 0-9.
> show_gi Show GI's in deflines? T/F
> seqalign_file seqalign file to output.
> align_outfile Output file for alignment.
>
> """
> att2param = {
> 'multihit' : '-P',
> 'gapped' : '-g',
> 'expectation' : '-e',
> 'range_restriction' : '-L',
> 'xdrop' : '-X',
> 'xdrop_final' : '-Z',
> 'xdrop_extension' : '-y',
> 'search_length' : '-Y',
> 'nbits_gapping' : '-N',
> 'protein' : '-p',
> 'db_length' : '-z',
>
> 'database' : '-d',
> 'infile' : '-i',
> 'filter' : '-F',
> 'case_filter' : '-U',
> 'believe_query' : '-J',
> 'nprocessors' : '-a',
> 'logfile' : '-l',
>
> 'html' : '-T',
> 'descriptions' : '-v',
> 'alignments' : '-b',
> 'align_view' : '-m',
> 'show_gi' : '-I',
> 'seqalign_file' : '-O',
> 'align_outfile' : '-o'
> }
>
> if not os.path.exists(blastcmd):
> raise ValueError, "rpsblast does not exist at %s" % blastcmd
>
> params = []
>
> params.extend([att2param['database'], database])
> params.extend([att2param['infile'], infile])
> params.extend([att2param['align_view'], align_view])
>
> for attr in keywds.keys():
> params.extend([att2param[attr], str(keywds[attr])])
>
> w, r, e = os.popen3(" ".join([blastcmd] + params))
> w.close()
> return File.UndoHandle(r), File.UndoHandle(e)
>
> def _re_search(regex, line, error_msg):
> m = re.search(regex, line)
> if not m:
> raise SyntaxError, error_msg
> return m.groups()
>
> def _get_cols(line, cols_to_get, ncols=None, expected={}):
> cols = line.split()
>
> # Check to make sure number of columns is correct
> if ncols is not None and len(cols) != ncols:
> raise SyntaxError, "I expected %d columns (got %d) in line\n%s" % \
> (ncols, len(cols), line)
>
> # Check to make sure columns contain the correct data
> for k in expected.keys():
> if cols[k] != expected[k]:
> raise SyntaxError, "I expected '%s' in column %d in line\n%s" % (
> expected[k], k, line)
>
> # Construct the answer tuple
> results = []
> for c in cols_to_get:
> results.append(cols[c])
> return tuple(results)
>
> def _safe_int(str):
> try:
> return int(str)
> except ValueError:
> # Something went wrong. Try to clean up the string.
> # Remove all commas from the string
> str = str.replace(',', '')
> try:
> # try again.
> return int(str)
> except ValueError:
> pass
> # If it fails again, maybe it's too long?
> # XXX why converting to float?
> return long(float(str))
>
> def _safe_float(str):
> # Thomas Rosleff Soerensen (rosleff at mpiz-koeln.mpg.de) noted that
> # float('e-172') does not produce an error on his platform. Thus,
> # we need to check the string for this condition.
>
> # Sometimes BLAST leaves of the '1' in front of an exponent.
> if str and str[0] in ['E', 'e']:
> str = '1' + str
> try:
> return float(str)
> except ValueError:
> # Remove all commas from the string
> str = str.replace(',', '')
> # try again.
> return float(str)
>
> class _BlastErrorConsumer(_BlastConsumer):
> def __init__(self):
> _BlastConsumer.__init__(self)
> def noevent(self, line):
> if line.find("Query must be at least wordsize") != -1:
> raise ShortQueryBlastError, "Query must be at least wordsize"
> # Now pass the line back up to the superclass.
> method = getattr(_BlastConsumer, 'noevent',
> _BlastConsumer.__getattr__(self, 'noevent'))
> method(line)
>
> class BlastErrorParser(AbstractParser):
> """Attempt to catch and diagnose BLAST errors while parsing.
>
> This utilizes the BlastParser module but adds an additional layer
> of complexity on top of it by attempting to diagnose SyntaxError's
> that may actually indicate problems during BLAST parsing.
>
> Current BLAST problems this detects are:
> o LowQualityBlastError - When BLASTing really low quality sequences
> (ie. some GenBank entries which are just short streches of a single
> nucleotide), BLAST will report an error with the sequence and be
> unable to search with this. This will lead to a badly formatted
> BLAST report that the parsers choke on. The parser will convert the
> SyntaxError to a LowQualityBlastError and attempt to provide useful
> information.
>
> """
> def __init__(self, bad_report_handle = None):
> """Initialize a parser that tries to catch BlastErrors.
>
> Arguments:
> o bad_report_handle - An optional argument specifying a handle
> where bad reports should be sent. This would allow you to save
> all of the bad reports to a file, for instance. If no handle
> is specified, the bad reports will not be saved.
> """
> self._bad_report_handle = bad_report_handle
>
> #self._b_parser = BlastParser()
> self._scanner = _Scanner()
> self._consumer = _BlastErrorConsumer()
>
> def parse(self, handle):
> """Parse a handle, attempting to diagnose errors.
> """
> results = handle.read()
>
> try:
> self._scanner.feed(File.StringHandle(results), self._consumer)
> except SyntaxError, msg:
> # if we have a bad_report_file, save the info to it first
> if self._bad_report_handle:
> # send the info to the error handle
> self._bad_report_handle.write(results)
>
> # now we want to try and diagnose the error
> self._diagnose_error(
> File.StringHandle(results), self._consumer.data)
>
> # if we got here we can't figure out the problem
> # so we should pass along the syntax error we got
> raise
> return self._consumer.data
>
> def _diagnose_error(self, handle, data_record):
> """Attempt to diagnose an error in the passed handle.
>
> Arguments:
> o handle - The handle potentially containing the error
> o data_record - The data record partially created by the consumer.
> """
> line = handle.readline()
>
> while line:
> # 'Searchingdone' instead of 'Searching......done' seems
> # to indicate a failure to perform the BLAST due to
> # low quality sequence
> if line.startswith('Searchingdone'):
> raise LowQualityBlastError("Blast failure occured on query: ",
> data_record.query)
> line = handle.readline()
>
>
>
>
> # BLAST XML parsing
> """This module provides code to work with the BLAST XML output
> following the DTD available on the NCBI FTP
> ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/NCBI_BlastOutput.dtd
>
> Classes:
> BlastParser Parses XML output from BLAST.
>
> _XMLParser Generic SAX parser.
> """
> from Bio.Blast import Record
> import xml.sax
> from xml.sax.handler import ContentHandler
>
> class _XMLparser(ContentHandler):
> """Generic SAX Parser
>
> Just a very basic SAX parser.
>
> Redefine the methods startElement, characters and endElement.
> """
> def __init__(self):
> """Constructor
> """
> self._tag = []
> self._value = ''
>
> def _secure_name(self, name):
> """Removes 'dangerous' from tag names
>
> name -- name to be 'secured'
> """
> # Replace '-' with '_' in XML tag names
> return name.replace('-', '_')
>
> def startElement(self, name, attr):
> """Found XML start tag
>
> No real need of attr, BLAST DTD doesn't use them
>
> name -- name of the tag
>
> attr -- tag attributes
> """
> self._tag.append(name)
>
> # Try to call a method
> try:
> eval(self._secure_name('self._start_' + name))()
> except AttributeError:
> # Doesn't exist (yet)
> pass
>
> def characters(self, ch):
> """Found some text
>
> ch -- characters read
> """
> self._value += ch # You don't ever get the whole string
>
> def endElement(self, name):
> """Found XML end tag
>
> name -- tag name
> """
> # Strip character buffer
> self._value = self._value.strip()
>
> # Try to call a method (defined in subclasses)
> try:
> eval(self._secure_name('self._end_' + name))()
> except AttributeError: # Method doesn't exist (yet ?)
> pass
>
> # Reset character buffer
> self._value = ''
>
> class BlastParser(_XMLparser):
> """Parse XML BLAST data into a Record.Blast object
>
> Methods:
> parse Parses BLAST XML data.
>
> All XML 'action' methods are private methods and may be:
> _start_TAG called when the start tag is found
> _end_TAG called when the end tag is found
> """
>
> def __init__(self):
> """Constructor
> """
> # Calling superclass method
> _XMLparser.__init__(self)
>
> self._parser = xml.sax.make_parser()
> self._parser.setContentHandler(self)
>
> # To avoid ValueError: unknown url type: NCBI_BlastOutput.dtd
> self._parser.setFeature(xml.sax.handler.feature_validation, 0)
> self._parser.setFeature(xml.sax.handler.feature_namespaces, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_pes, 0)
> self._parser.setFeature(xml.sax.handler.feature_external_ges, 0)
>
> self._blast = Record.Blast()
>
> def parse(self, handler):
> """Parses the XML data
>
> handler -- file handler or StringIO
> """
> # initialize a new Blast Record
> #self._blast = Record.Blast()
> # FIXME: very slow?
> self._blast.__init__()
>
> # bugfix: changed `filename` to `handler`. Iddo 12/20/2004
> self._parser.parse(handler)
> return self._blast
>
> # Header
> def _end_BlastOutput_program(self):
> """BLAST program, e.g., blastp, blastn, etc.
> """
> self._blast.application = self._value.uuper()
>
> def _end_BlastOutput_version(self):
> """version number of the BLAST engine (e.g., 2.1.2)
> """
> self._blast.version = self._value.split()[1]
> self._blast.date = self._value.split()[2][1:-1]
>
> def _end_BlastOutput_reference(self):
> """a reference to the article describing the algorithm
> """
> self._blast.reference = self._value
>
> def _end_BlastOutput_db(self):
> """the database(s) searched
> """
> self._blast.database = self._value
>
> def _end_BlastOutput_query_ID(self):
> """the identifier of the query
> """
> self._blast.query_id = self._value
>
> def _end_BlastOutput_query_def(self):
> """the definition line of the query
> """
> self._blast.query = self._value
>
> def _end_BlastOutput_query_len(self):
> """the length of the query
> """
> self._blast.query_length = int(self._value)
>
> ## def _end_BlastOutput_query_seq(self):
> ## """the query sequence
> ## """
> ## pass # XXX Missing in Record.Blast ?
>
> ## def _end_BlastOutput_iter_num(self):
> ## """the psi-blast iteration number
> ## """
> ## pass # XXX TODO PSI
>
> # non-existent in blastall 2.2.13 output
> def _end_BlastOutput_hits(self):
> """hits to the database sequences, one for every sequence
> """
> self._blast.num_hits = int(self._value)
>
> ## def _end_BlastOutput_message(self):
> ## """error messages
> ## """
> ## pass # XXX What to do ?
>
> # Parameters
> def _end_Parameters_matrix(self):
> """matrix used (-M)
> """
> self._blast.matrix = self._value
>
> def _end_Parameters_expect(self):
> """expect values cutoff (-e)
> """
> self._blast.expect = self._value
>
> ## def _end_Parameters_include(self):
> ## """inclusion threshold for a psi-blast iteration (-h)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Parameters_sc_match(self):
> """match score for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_match = int(self._value)
>
> def _end_Parameters_sc_mismatch(self):
> """mismatch penalty for nucleotide-nucleotide comparaison (-r)
> """
> self._blast.sc_mismatch = int(self._value)
>
> def _end_Parameters_gap_open(self):
> """gap existence cost (-G)
> """
> self._blast.gap_penalties[0] = int(self._value)
>
> def _end_Parameters_gap_extend(self):
> """gap extension cose (-E)
> """
> self._blast.gap_penalties[1] = int(self._value)
>
> def _end_Parameters_filter(self):
> """filtering options (-F)
> """
> self._blast.filter = self._value
>
> ## def _end_Parameters_pattern(self):
> ## """pattern used for phi-blast search
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Parameters_entrez_query(self):
> ## """entrez query used to limit search
> ## """
> ## pass # XXX TODO PSI
>
> # Hits
> def _start_Hit(self):
> self._blast.alignments.append(Record.Alignment())
> self._blast.descriptions.append(Record.Description())
> self._blast.multiple_alignment = []
> self._hit = self._blast.alignments[-1]
> self._descr = self._blast.descriptions[-1]
> self._descr.num_alignments = 0
>
> # Hit_num is useless
>
> def _end_Hit_id(self):
> """identifier of the matched database sequence
> """
> self._hit.title_id = self._value
> self._descr.title_id = self._value
>
> def _end_Hit_def(self):
> """definition line (title) of the database sequence
> """
> self._hit.title = self._value
> self._descr.title = self._value
>
> # not necessary?
> def _end_Hit_accession(self):
> """accession of the database sequence
> """
> self._hit.accession = self._value
> self._descr.accession = self._value
>
> def _end_Hit_len(self):
> self._hit.length = int(self._value)
>
> # HSPs
> def _start_Hsp(self):
> self._hit.hsps.append(Record.HSP())
> self._hsp = self._hit.hsps[-1]
> self._descr.num_alignments += 1
> self._blast.multiple_alignment.append(Record.MultipleAligment())
> self._mult_al = self._blast.multiple_alignment[-1]
>
> # Hsp_num is useless
> def _end_Hsp_score(self):
> """raw score of HSP
> """
> self._hsp.score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.score == None:
> self._descr.score = float(self._value)
>
> def _end_Hsp_bit_score(self):
> """bit score of HSP
> """
> self._hsp.bit_score = float(self._value)
> # hits are in order of best score to worst. keep best
> if self._descr.bit_score == None:
> self._descr.bit_score = float(self._value)
>
> def _end_Hsp_evalue(self):
> """expect value of the HSP
> """
> self._hsp.evalue = float(self._value)
> if self._descr.evalue == None:
> self._descr.evalue = float(self._value)
>
> def _end_Hsp_query_from(self):
> """offset of query at the start of the alignment (one-offset)
> """
> self._hsp.query_start = int(self._value)
>
> def _end_Hsp_query_to(self):
> """offset of query at the end of the alignment (one-offset)
> """
> self._hsp.query_stop = int(self._value)
>
> def _end_Hsp_hit_from(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_start = int(self._value)
>
> def _end_Hsp_hit_to(self):
> """offset of the database at the start of the alignment (one-offset)
> """
> self._hsp.sbjct_stop = int(self._value)
>
>
> ## def _end_Hsp_pattern_from(self):
> ## """start of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> ## def _end_Hsp_pattern_to(self):
> ## """end of phi-blast pattern on the query (one-offset)
> ## """
> ## pass # XXX TODO PSI
>
> def _end_Hsp_query_frame(self):
> """frame of the query if applicable
> """
> self._hsp.frame = (int(self._value),)
>
> def _end_Hsp_hit_frame(self):
> """frame of the database sequence if applicable
> """
> self._hsp.frame += (int(self._value),)
>
> def _end_Hsp_identity(self):
> """number of identities in the alignment
> """
> self._hsp.identities = (int(self._value), None)
>
> def _end_Hsp_positive(self):
> """number of positive (conservative) substitutions in the alignment
> """
> self._hsp.positives = (int(self._value), None)
>
> def _end_Hsp_align_len(self):
> """length of the alignment
> """
> self._hsp.align_length = int(self._value)
>
> def _end_Hsp_gaps(self):
> """number of gaps in the alignment
> """
> self._hsp.gaps = (int(self._value),None)
>
> ## def _en_Hsp_density(self):
> ## """score density
> ## """
> ## pass # XXX ???
>
> def _end_Hsp_qseq(self):
> """alignment string for the query
> """
> self._hsp.query = self._value
>
> def _end_Hsp_hseq(self):
> """alignment string for the database
> """
> self._hsp.sbjct = self._value
>
> def _end_Hsp_midline(self):
> """Formatting middle line as normally seen in BLAST report
> """
> self._hsp.match = self._value
>
> # Statistics
> def _end_Statistics_db_num(self):
> """number of sequences in the database
> """
> self._blast.num_sequences_in_database = int(self._value)
>
> def _end_Statistics_db_len(self):
> """number of letters in the database
> """
> self._blast.num_letters_in_database = int(self._value)
>
> def _end_Statistics_hsp_len(self):
> """the effective HSP length
> """
> self._blast.effective_hsp_length = int(self._value)
>
> def _end_Statistics_eff_space(self):
> """the effective search space
> """
> self._blast.effective_search_space = float(self._value)
>
> def _end_Statistics_kappa(self):
> """Karlin-Altschul parameter K
> """
> self._blast.ka_params[0] = float(self._value)
>
> def _end_Statistics_lambda(self):
> """Karlin-Altschul parameter Lambda
> """
> self._blast.ka_params[1] = float(self._value)
>
> def _end_Statistics_entropy(self):
> """Karlin-Altschul parameter H
> """
> self._blast.ka_params[2] = float(self._value)
>
>
> if __name__ == '__main__':
> import sys
> p = BlastParser()
> r = p.parse(sys.argv[1])
>
> # Small test
> print 'Blast of', r.query
> print 'Found %s alignments with a total of %s HSPs' % (len(r.alignments),
> reduce(lambda a,b: a+b,
> [len(a.hsps) for a in r.alignments]))
>
> for al in r.alignments:
> print al.title[:50], al.length, 'bp', len(al.hsps), 'HSPs'
>
> # Cookbook example
> E_VALUE_THRESH = 0.04
> for alignment in r.alignments:
> for hsp in alignment.hsps:
> if hsp.expect < E_VALUE_THRESH:
> print '*****'
> print 'sequence', alignment.title
> print 'length', alignment.length
> print 'e value', hsp.expect
> print hsp.query[:75] + '...'
> print hsp.match[:75] + '...'
> print hsp.sbjct[:75] + '...'
>
>
>
> # Copyright 1999-2000 by Jeffrey Chang. All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license. Please see the LICENSE file that should have been included
> # as part of this package.
>
> """Record classes to hold BLAST output.
>
> Classes:
> Blast Holds all the information from a blast search.
> PSIBlast Holds all the information from a psi-blast search.
>
> Header Holds information from the header.
> Description Holds information about one hit description.
> Alignment Holds information about one alignment hit.
> HSP Holds information about one HSP.
> MultipleAlignment Holds information about a multiple alignment.
> DatabaseReport Holds information from the database report.
> Parameters Holds information from the parameters.
>
> """
> # XXX finish printable BLAST output
>
> import string
>
> from Bio.Align import Generic
>
> class Header:
> """Saves information from a blast header.
>
> Members:
> application The name of the BLAST flavor that generated this data.
> version Version of blast used.
> date Date this data was generated.
> reference Reference for blast.
>
> query Name of the query sequence.
> query_id Query ID (necessary?) (str)
> query_length Length of the query sequence. (int)
>
> database Name of the database.
> """
> def __init__(self):
> self.application = ''
> self.version = ''
> self.date = ''
> self.reference = ''
>
> self.query = ''
> self.query_id = ''
> self.query_length = None
>
> self.database = ''
>
> class Description:
> """Stores information about one hit in the descriptions section.
>
> Members:
> title Title of the hit.
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> score Number of bits. (int)
> evalue Expect value. (float)
> num_alignments Number of alignments for the same subject. (int)
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.score = None
> self.bit_score = None
> self.evalue = None
> self.num_alignments = None
> def __str__(self):
> return "%-66s %5s %s %s" % (self.title, self.score, self.bit_score, self.evalue)
>
> class Alignment:
> """Stores information about one hit in the alignments section.
>
> Members:
> title Name of the matched sequence. (str)
> title_id Hit ID (necessary?). (str)
> accession Hit accession (necessary?). (str)
> length Length of matched sequence. (int)
> hsps A list of HSP objects.
>
> """
> def __init__(self):
> self.title = ''
> self.title_id = ''
> self.accession = ''
> self.length = None
> self.hsps = []
> def __str__(self):
> lines = []
> titles = string.split(self.title, '\n')
> for i in range(len(titles)):
> if i:
> lines.append(" ")
> lines.append("%s\n" % titles[i])
> lines.append(" Length = %s\n" % self.length)
> return string.join(lines, '')
>
> class HSP:
> """Stores information about one hsp in an alignment hit.
>
> Members:
> score BLAST score of hit. (float)
> bit_score Number of bits for that score. (float)
> evalue Expect value. (float)
> num_alignments Number of alignments for same subject. (int)
> identities Number of identities/total aligned. tuple of (int, int)
> positives Number of positives/total aligned. tuple of (int, int)
> gaps Numer of gaps/total aligned. tuple of (int, int)
> strand Tuple of (query, target) strand.
> frame Tuple of 1 or 2 frame shifts, depending on the flavor.
>
> query The query sequence.
> query_start The start residue for the query sequence. (1-based)
> query_end The end residue for the query sequence. (1-based)
> match The match sequence.
> sbjct The sbjct sequence.
> sbjct_start The start residue for the sbjct sequence. (1-based)
> sbjct_end The end residue for the sbjct sequence. (1-based)
>
> align_length Length of the alignment. (int)
>
> Not all flavors of BLAST return values for every attribute:
> score expect identities positives strand frame
> BLASTP X X X X
> BLASTN X X X X X
> BLASTX X X X X X
> TBLASTN X X X X X
> TBLASTX X X X X X/X
>
> Note: for BLASTX, the query sequence is shown as a protein sequence,
> but the numbering is based on the nucleotides. Thus, the numbering
> is 3x larger than the number of amino acid residues. A similar effect
> can be seen for the sbjct sequence in TBLASTN, and for both sequences
> in TBLASTX.
>
> Also, for negative frames, the sequence numbering starts from
> query_start and counts down.
>
> """
> def __init__(self):
> self.score = None
> self.bit_score = None
> self.expect = None
> self.num_alignments = None
> self.identities = (None, None)
> self.positives = (None, None)
> self.gaps = (None, None)
> self.strand = (None, None)
> self.frame = ()
>
> self.query = ''
> self.query_start = None
> self.query_stop = None
> self.match = ''
> self.sbjct = ''
> self.sbjct_start = None
> self.sbjct_stop = None
>
> self.align_length = None
>
> class MultipleAlignment:
> """Holds information about a multiple alignment.
>
> Members:
> alignment A list of tuples (name, start residue, sequence, end residue).
>
> The start residue is 1-based. It may be blank, if that sequence is
> not aligned in the multiple alignment.
>
> """
> def __init__(self):
> self.alignment = []
>
> def to_generic(self, alphabet):
> """Retrieve generic alignment object for the given alignment.
>
> Instead of the tuples, this returns an Alignment object from
> Bio.Align.Generic, through which you can manipulate and query
> the object.
>
> alphabet is the specified alphabet for the sequences in the code (for
> example IUPAC.IUPACProtein.
>
> Thanks to James Casbon for the code.
> """
> seq_parts = []
> seq_names = []
> parse_number = 0
> n = 0
> for name, start, seq, end in self.alignment:
> if name == 'QUERY': #QUERY is the first in each alignment block
> parse_number = parse_number + 1
> n = 0
>
> if parse_number == 1: # create on first_parse, append on all others
> seq_parts.append(seq)
> seq_names.append(name)
> else:
> seq_parts[n] = seq_parts[n] + seq
> n = n + 1
>
> generic = Generic.Alignment(alphabet)
> for (name,seq) in zip(seq_names,seq_parts):
> generic.add_sequence(name, seq)
>
> return generic
>
> class Round:
> """Holds information from a PSI-BLAST round.
>
> Members:
> number Round number. (int)
> reused_seqs Sequences in model, found again. List of Description objects.
> new_seqs Sequences not found, or below threshold. List of Description.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
>
> """
> def __init__(self):
> self.number = None
> self.reused_seqs = []
> self.new_seqs = []
> self.alignments = []
> self.multiple_alignment = None
>
> class DatabaseReport:
> """Holds information about a database report.
>
> Members:
> database_name List of database names. (can have multiple dbs)
> num_letters_in_database Number of letters in the database. (int)
> num_sequences_in_database List of number of sequences in the database. (int)
> posted_date List of the dates the databases were posted.
> ka_params Array of [lambda, k, h] values. (floats)
> gapped # XXX this isn't set right!
> ka_params_gap Array of [lambda, k, h] values. (floats)
>
> """
> def __init__(self):
> self.database_name = []
> self.posted_date = []
> self.num_letters_in_database = None
> self.num_sequences_in_database = None
> self.ka_params = [None, None, None]
> self.gapped = 0
> self.ka_params_gap = [None, None, None]
>
> class Parameters:
> """Holds information about the parameters.
>
> Members:
> matrix Name of the matrix.
> filter Filter parameter. (str)
> gap_penalties Array of [open, extend] penalties. (floats)
> sc_match Match score for nucleotide-nucleotide comparison
> sc_mismatch Mismatch penalty for nucleotide-nucleotide comparison
> num_hits Number of hits to the database. (int)
> num_sequences Number of sequences. (int)
> num_good_extends Number of extensions. (int)
> expect Expectation value. (int)
> hsps_no_gap Number of HSP's better, without gapping. (int)
> hsps_prelim_gapped Number of HSP's gapped in prelim test. (int)
> hsps_prelim_gapped_attemped Number of HSP's attempted in prelim. (int)
> hsps_gapped Total number of HSP's gapped. (int)
> query_length Length of the query. (int)
> database_length Number of letters in the database. (int)
> effective_hsp_length Effective HSP length. (int)
> effective_query_length Effective length of query. (int)
> effective_database_length Effective length of database. (int)
> effective_search_space Effective search space. (int)
> effective_search_space_used Effective search space used. (int)
> frameshift Frameshift window. Tuple of (int, float)
> threshold Threshold. (int)
> window_size Window size. (int)
> dropoff_1st_pass Tuple of (score, bits). (int, float)
> gap_x_dropoff Tuple of (score, bits). (int, float)
> gap_x_dropoff_final Tuple of (score, bits). (int, float)
> gap_trigger Tuple of (score, bits). (int, float)
> blast_cutoff Tuple of (score, bits). (int, float)
> """
> def __init__(self):
> self.matrix = ''
> self.filter = ''
> self.gap_penalties = [None, None]
> self.sc_match = None
> self.sc_mismatch = None
> self.num_hits = None
> self.num_sequences = None
> self.num_good_extends = None
> self.expect = None
> self.hsps_no_gap = None
> self.hsps_prelim_gapped = None
> self.hsps_prelim_gapped_attemped = None
> self.hsps_gapped = None
> self.query_length = None
> self.database_length = None
> self.effective_hsp_length = None
> self.effective_query_length = None
> self.effective_database_length = None
> self.effective_search_space = None
> self.effective_search_space_used = None
> self.frameshift = (None, None)
> self.threshold = None
> self.window_size = None
> self.dropoff_1st_pass = (None, None)
> self.gap_x_dropoff = (None, None)
> self.gap_x_dropoff_final = (None, None)
> self.gap_trigger = (None, None)
> self.blast_cutoff = (None, None)
>
> class Blast(Header, DatabaseReport, Parameters):
> """Saves the results from a blast search.
>
> Members:
> descriptions A list of Description objects.
> alignments A list of Alignment objects.
> multiple_alignment A MultipleAlignment object.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.descriptions = []
> self.alignments = []
> self.multiple_alignment = None
>
> class PSIBlast(Header, DatabaseReport, Parameters):
> """Saves the results from a blastpgp search.
>
> Members:
> rounds A list of Round objects.
> converged Whether the search converged.
> + members inherited from base classes
>
> """
> def __init__(self):
> Header.__init__(self)
> DatabaseReport.__init__(self)
> Parameters.__init__(self)
> self.rounds = []
> self.converged = 0
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
From jmjoseph at andrew.cmu.edu Wed Jul 19 19:18:12 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Wed, 19 Jul 2006 15:18:12 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44B58141.5080804@maubp.freeserve.co.uk>
<44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
Message-ID: <44BE8574.1020205@andrew.cmu.edu>
I do not believe the current version of the parser will work with
multiple queries using recent version of blast, regardless of the output
format. I do know that blastall 2.2.13 with XML functions with the
parser corrections previously attached. I have attached a further
updated NCBIXML.py, fixing the performance issues in parse() that I
mentioned.
-Jacob
Rohini Damle wrote:
> Hi,
> Can someone suggest me for which version of Blast, the Biopython's
> (text or xml) parser works fine?
> I will download that blast version locally and can use biopython's parser.
> thanx,
> Rohini
>
> On 7/18/06, Jacob Joseph wrote:
>> Hi.
>> I encountered similar difficulties over the past few days myself and
>> have made some improvements to the XML parser. Well, that is, it now
>> functions with blastall, but I have made no effort to parse the other
>> blast programs. I do not expect I have done any harm to other parsing,
>> however.
>>
>> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
>> yet spent significant time to clean up my changes. Without getting into
>> specific modifications, I have made an effort to make consistent the
>> variables in Record and NCBIXML, focusing primarily on what I needed
>> this week.
>>
>> One portion I am not settled on reinitialization of Record.Blast at
>> every call to iterator.next(), and, by extension, BlastParser.parse().
>> See NCBIXML.py, line 114. Without re-initializing this class, we run
>> the risk of retaining portions of a Record from previously parsed
>> queries. This causes the bug 1970, mentioned below. Unfortunately,
>> this re-initialization exacts a significant performance penalty of at
>> least a factor of 10 by some rough measures. I would appreciate any
>> suggestions for improvement here.
>>
>> I do apologize for not being more specific about my changes. When I get
>> a chance(next week?), I will package them up as a proper patch and file
>> a bug. Perhaps what I have done so far will be of use until then.
>>
>> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
>> not have separate blocks within its output, requiring a different
>> method of iteration.
>>
>> -Jacob
>>
>> Peter wrote:
>> > Rohini Damle wrote:
>> >> Hi,
>> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
>> >> I am trying to extract alignment information for each of them.
>> >> So I wrote the following code:
>> >>
>> >> for b_record in b_iterator :
>> >>
>> >> E_VALUE_THRESH =20
>> >> for alignment in b_record.alignments:
>> >> for hsp in alignment.hsps:
>> >> if hsp.expect< E_VALUE_THRESH:
>> >>
>> >> print '****Alignment****'
>> >> print 'sequence:',
>> alignment.title.split()[0]
>> >>
>> >> With this code, I am getting information for P1,
>> >> then information for P1 + P2
>> >> then for P1+P2 +P3
>> >> and finally for P1+P2+P3+P4
>> >> why this is so?
>> >> is there something wrong with the looping?
>> >
>> > I'm aware of something funny with the XML parsing, Bug 1970, which
>> might
>> > well be the same issue:
>> >
>> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> >
>> > I confess I haven't looked into exactly what is going wrong here - too
>> > many other demands on my time to learn about XML and how BioPython
>> > parses it.
>> >
>> > Does the work around on the bug report help? Depending on which
>> version
>> > of standalone blast you have installed, you might have better luck with
>> > plain text output - the trouble is this is a moving target and the NBCI
>> > keeps tweaking it.
>> >
>> > Peter
>> >
>> > _______________________________________________
>> > BioPython mailing list - BioPython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NCBIXML.py.gz
Type: application/x-gzip
Size: 3209 bytes
Desc: not available
URL:
From rohini.damle at gmail.com Thu Jul 20 16:39:12 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 09:39:12 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BE8574.1020205@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
Message-ID:
Hi,
When I tried on your NCBIXML.py code instead of oringinal one I am
getting following error messege:
File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
in _end_Parameters_gap_open
self._blast.gap_penalties[0] = int(self._value)
TypeError: object does not support item assignment
in the original version
we don't have that " [0] " in self._blast.gap_penalties
what might be causing this error?
-Rohini
On 7/19/06, Jacob Joseph wrote:
> I do not believe the current version of the parser will work with
> multiple queries using recent version of blast, regardless of the output
> format. I do know that blastall 2.2.13 with XML functions with the
> parser corrections previously attached. I have attached a further
> updated NCBIXML.py, fixing the performance issues in parse() that I
> mentioned.
>
> -Jacob
>
> Rohini Damle wrote:
> > Hi,
> > Can someone suggest me for which version of Blast, the Biopython's
> > (text or xml) parser works fine?
> > I will download that blast version locally and can use biopython's parser.
> > thanx,
> > Rohini
> >
> > On 7/18/06, Jacob Joseph wrote:
> >> Hi.
> >> I encountered similar difficulties over the past few days myself and
> >> have made some improvements to the XML parser. Well, that is, it now
> >> functions with blastall, but I have made no effort to parse the other
> >> blast programs. I do not expect I have done any harm to other parsing,
> >> however.
> >>
> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> >> yet spent significant time to clean up my changes. Without getting into
> >> specific modifications, I have made an effort to make consistent the
> >> variables in Record and NCBIXML, focusing primarily on what I needed
> >> this week.
> >>
> >> One portion I am not settled on reinitialization of Record.Blast at
> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> >> the risk of retaining portions of a Record from previously parsed
> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> >> this re-initialization exacts a significant performance penalty of at
> >> least a factor of 10 by some rough measures. I would appreciate any
> >> suggestions for improvement here.
> >>
> >> I do apologize for not being more specific about my changes. When I get
> >> a chance(next week?), I will package them up as a proper patch and file
> >> a bug. Perhaps what I have done so far will be of use until then.
> >>
> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> >> not have separate blocks within its output, requiring a different
> >> method of iteration.
> >>
> >> -Jacob
> >>
> >> Peter wrote:
> >> > Rohini Damle wrote:
> >> >> Hi,
> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3, P4)
> >> >> I am trying to extract alignment information for each of them.
> >> >> So I wrote the following code:
> >> >>
> >> >> for b_record in b_iterator :
> >> >>
> >> >> E_VALUE_THRESH =20
> >> >> for alignment in b_record.alignments:
> >> >> for hsp in alignment.hsps:
> >> >> if hsp.expect< E_VALUE_THRESH:
> >> >>
> >> >> print '****Alignment****'
> >> >> print 'sequence:',
> >> alignment.title.split()[0]
> >> >>
> >> >> With this code, I am getting information for P1,
> >> >> then information for P1 + P2
> >> >> then for P1+P2 +P3
> >> >> and finally for P1+P2+P3+P4
> >> >> why this is so?
> >> >> is there something wrong with the looping?
> >> >
> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> >> might
> >> > well be the same issue:
> >> >
> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >> >
> >> > I confess I haven't looked into exactly what is going wrong here - too
> >> > many other demands on my time to learn about XML and how BioPython
> >> > parses it.
> >> >
> >> > Does the work around on the bug report help? Depending on which
> >> version
> >> > of standalone blast you have installed, you might have better luck with
> >> > plain text output - the trouble is this is a moving target and the NBCI
> >> > keeps tweaking it.
> >> >
> >> > Peter
> >> >
> >> > _______________________________________________
> >> > BioPython mailing list - BioPython at lists.open-bio.org
> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>
From jmjoseph at andrew.cmu.edu Thu Jul 20 18:08:15 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Thu, 20 Jul 2006 14:08:15 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com> <44B639D6.2010503@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
Message-ID: <44BFC68F.8030802@andrew.cmu.edu>
Hi. I suspect you are not using my updated Record.py. You'll notice
that, at least for the moment, I have changed _blast.gap_penalties to an
array to allow assignment per item without worrying about the order of
entries within the xml file. There are other ways this could be
accomplished while still using a tuple.
-Jacob
Rohini Damle wrote:
> Hi,
> When I tried on your NCBIXML.py code instead of oringinal one I am
> getting following error messege:
>
> File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> in _end_Parameters_gap_open
> self._blast.gap_penalties[0] = int(self._value)
> TypeError: object does not support item assignment
>
> in the original version
> we don't have that " [0] " in self._blast.gap_penalties
>
> what might be causing this error?
> -Rohini
>
> On 7/19/06, Jacob Joseph wrote:
>> I do not believe the current version of the parser will work with
>> multiple queries using recent version of blast, regardless of the output
>> format. I do know that blastall 2.2.13 with XML functions with the
>> parser corrections previously attached. I have attached a further
>> updated NCBIXML.py, fixing the performance issues in parse() that I
>> mentioned.
>>
>> -Jacob
>>
>> Rohini Damle wrote:
>> > Hi,
>> > Can someone suggest me for which version of Blast, the Biopython's
>> > (text or xml) parser works fine?
>> > I will download that blast version locally and can use biopython's
>> parser.
>> > thanx,
>> > Rohini
>> >
>> > On 7/18/06, Jacob Joseph wrote:
>> >> Hi.
>> >> I encountered similar difficulties over the past few days myself and
>> >> have made some improvements to the XML parser. Well, that is, it now
>> >> functions with blastall, but I have made no effort to parse the other
>> >> blast programs. I do not expect I have done any harm to other
>> parsing,
>> >> however.
>> >>
>> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
>> >> yet spent significant time to clean up my changes. Without getting
>> into
>> >> specific modifications, I have made an effort to make consistent the
>> >> variables in Record and NCBIXML, focusing primarily on what I needed
>> >> this week.
>> >>
>> >> One portion I am not settled on reinitialization of Record.Blast at
>> >> every call to iterator.next(), and, by extension, BlastParser.parse().
>> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
>> >> the risk of retaining portions of a Record from previously parsed
>> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
>> >> this re-initialization exacts a significant performance penalty of at
>> >> least a factor of 10 by some rough measures. I would appreciate any
>> >> suggestions for improvement here.
>> >>
>> >> I do apologize for not being more specific about my changes. When
>> I get
>> >> a chance(next week?), I will package them up as a proper patch and
>> file
>> >> a bug. Perhaps what I have done so far will be of use until then.
>> >>
>> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
>> >> not have separate blocks within its output, requiring a
>> different
>> >> method of iteration.
>> >>
>> >> -Jacob
>> >>
>> >> Peter wrote:
>> >> > Rohini Damle wrote:
>> >> >> Hi,
>> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
>> P4)
>> >> >> I am trying to extract alignment information for each of them.
>> >> >> So I wrote the following code:
>> >> >>
>> >> >> for b_record in b_iterator :
>> >> >>
>> >> >> E_VALUE_THRESH =20
>> >> >> for alignment in b_record.alignments:
>> >> >> for hsp in alignment.hsps:
>> >> >> if hsp.expect< E_VALUE_THRESH:
>> >> >>
>> >> >> print '****Alignment****'
>> >> >> print 'sequence:',
>> >> alignment.title.split()[0]
>> >> >>
>> >> >> With this code, I am getting information for P1,
>> >> >> then information for P1 + P2
>> >> >> then for P1+P2 +P3
>> >> >> and finally for P1+P2+P3+P4
>> >> >> why this is so?
>> >> >> is there something wrong with the looping?
>> >> >
>> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
>> >> might
>> >> > well be the same issue:
>> >> >
>> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> >> >
>> >> > I confess I haven't looked into exactly what is going wrong here
>> - too
>> >> > many other demands on my time to learn about XML and how BioPython
>> >> > parses it.
>> >> >
>> >> > Does the work around on the bug report help? Depending on which
>> >> version
>> >> > of standalone blast you have installed, you might have better
>> luck with
>> >> > plain text output - the trouble is this is a moving target and
>> the NBCI
>> >> > keeps tweaking it.
>> >> >
>> >> > Peter
>> >> >
>> >> > _______________________________________________
>> >> > BioPython mailing list - BioPython at lists.open-bio.org
>> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list - BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>>
From rohini.damle at gmail.com Thu Jul 20 18:59:52 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 11:59:52 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BFC68F.8030802@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com>
<44B6BC39.4050606@maubp.freeserve.co.uk>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID:
Hi,
Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
(all updated)
I am not getting that previous error.
BUT I am still not getting the desired output ...
Here is my code
blast_out = open("C:/Documents and Settings/rdamle/My
Documents/Rohini's Documents/Blast
Parsing/onlymouse4proteinblastout.xml", "r")
b_parser = NCBIXML.BlastParser()
b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
E_VALUE_THRESH = 22
for b_record in b_iterator :
for alignment in b_record.alignments:
for hsp in alignment.hsps:
if (hsp.expect< E_VALUE_THRESH):
print b_record.query.split()[0]
print '****Alignment****'
print 'sequence:',
alignment.title.split()[0]
with this code I was expecting to get all the alignments with
hsp.expect wrote:
> Hi. I suspect you are not using my updated Record.py. You'll notice
> that, at least for the moment, I have changed _blast.gap_penalties to an
> array to allow assignment per item without worrying about the order of
> entries within the xml file. There are other ways this could be
> accomplished while still using a tuple.
>
> -Jacob
>
> Rohini Damle wrote:
> > Hi,
> > When I tried on your NCBIXML.py code instead of oringinal one I am
> > getting following error messege:
> >
> > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> > in _end_Parameters_gap_open
> > self._blast.gap_penalties[0] = int(self._value)
> > TypeError: object does not support item assignment
> >
> > in the original version
> > we don't have that " [0] " in self._blast.gap_penalties
> >
> > what might be causing this error?
> > -Rohini
> >
> > On 7/19/06, Jacob Joseph wrote:
> >> I do not believe the current version of the parser will work with
> >> multiple queries using recent version of blast, regardless of the output
> >> format. I do know that blastall 2.2.13 with XML functions with the
> >> parser corrections previously attached. I have attached a further
> >> updated NCBIXML.py, fixing the performance issues in parse() that I
> >> mentioned.
> >>
> >> -Jacob
> >>
> >> Rohini Damle wrote:
> >> > Hi,
> >> > Can someone suggest me for which version of Blast, the Biopython's
> >> > (text or xml) parser works fine?
> >> > I will download that blast version locally and can use biopython's
> >> parser.
> >> > thanx,
> >> > Rohini
> >> >
> >> > On 7/18/06, Jacob Joseph wrote:
> >> >> Hi.
> >> >> I encountered similar difficulties over the past few days myself and
> >> >> have made some improvements to the XML parser. Well, that is, it now
> >> >> functions with blastall, but I have made no effort to parse the other
> >> >> blast programs. I do not expect I have done any harm to other
> >> parsing,
> >> >> however.
> >> >>
> >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> >> >> yet spent significant time to clean up my changes. Without getting
> >> into
> >> >> specific modifications, I have made an effort to make consistent the
> >> >> variables in Record and NCBIXML, focusing primarily on what I needed
> >> >> this week.
> >> >>
> >> >> One portion I am not settled on reinitialization of Record.Blast at
> >> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> >> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> >> >> the risk of retaining portions of a Record from previously parsed
> >> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> >> >> this re-initialization exacts a significant performance penalty of at
> >> >> least a factor of 10 by some rough measures. I would appreciate any
> >> >> suggestions for improvement here.
> >> >>
> >> >> I do apologize for not being more specific about my changes. When
> >> I get
> >> >> a chance(next week?), I will package them up as a proper patch and
> >> file
> >> >> a bug. Perhaps what I have done so far will be of use until then.
> >> >>
> >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> >> >> not have separate blocks within its output, requiring a
> >> different
> >> >> method of iteration.
> >> >>
> >> >> -Jacob
> >> >>
> >> >> Peter wrote:
> >> >> > Rohini Damle wrote:
> >> >> >> Hi,
> >> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
> >> P4)
> >> >> >> I am trying to extract alignment information for each of them.
> >> >> >> So I wrote the following code:
> >> >> >>
> >> >> >> for b_record in b_iterator :
> >> >> >>
> >> >> >> E_VALUE_THRESH =20
> >> >> >> for alignment in b_record.alignments:
> >> >> >> for hsp in alignment.hsps:
> >> >> >> if hsp.expect< E_VALUE_THRESH:
> >> >> >>
> >> >> >> print '****Alignment****'
> >> >> >> print 'sequence:',
> >> >> alignment.title.split()[0]
> >> >> >>
> >> >> >> With this code, I am getting information for P1,
> >> >> >> then information for P1 + P2
> >> >> >> then for P1+P2 +P3
> >> >> >> and finally for P1+P2+P3+P4
> >> >> >> why this is so?
> >> >> >> is there something wrong with the looping?
> >> >> >
> >> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> >> >> might
> >> >> > well be the same issue:
> >> >> >
> >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> >> >> >
> >> >> > I confess I haven't looked into exactly what is going wrong here
> >> - too
> >> >> > many other demands on my time to learn about XML and how BioPython
> >> >> > parses it.
> >> >> >
> >> >> > Does the work around on the bug report help? Depending on which
> >> >> version
> >> >> > of standalone blast you have installed, you might have better
> >> luck with
> >> >> > plain text output - the trouble is this is a moving target and
> >> the NBCI
> >> >> > keeps tweaking it.
> >> >> >
> >> >> > Peter
> >> >> >
> >> >> > _______________________________________________
> >> >> > BioPython mailing list - BioPython at lists.open-bio.org
> >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >>
> >>
> >> _______________________________________________
> >> BioPython mailing list - BioPython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >>
> >>
> >>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
From rohini.damle at gmail.com Thu Jul 20 19:05:14 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 20 Jul 2006 12:05:14 -0700
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID:
Hi,
I used hsp.evalue instead of hsp.expect and I am getting the desired output.
Thank you very much for your help, efforts, and all those modified files.
Rohini
On 7/20/06, Rohini Damle wrote:
> Hi,
> Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
> (all updated)
> I am not getting that previous error.
> BUT I am still not getting the desired output ...
> Here is my code
>
> blast_out = open("C:/Documents and Settings/rdamle/My
> Documents/Rohini's Documents/Blast
> Parsing/onlymouse4proteinblastout.xml", "r")
>
> b_parser = NCBIXML.BlastParser()
> b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
> E_VALUE_THRESH = 22
>
> for b_record in b_iterator :
> for alignment in b_record.alignments:
> for hsp in alignment.hsps:
> if (hsp.expect< E_VALUE_THRESH):
> print b_record.query.split()[0]
> print '****Alignment****'
> print 'sequence:',
> alignment.title.split()[0]
>
>
> with this code I was expecting to get all the alignments with
> hsp.expect
> BUT I AM GETTING ALL the alignments not just the one with evalue <22
> -Rohini.
>
>
>
>
>
> On 7/20/06, Jacob Joseph wrote:
> > Hi. I suspect you are not using my updated Record.py. You'll notice
> > that, at least for the moment, I have changed _blast.gap_penalties to an
> > array to allow assignment per item without worrying about the order of
> > entries within the xml file. There are other ways this could be
> > accomplished while still using a tuple.
> >
> > -Jacob
> >
> > Rohini Damle wrote:
> > > Hi,
> > > When I tried on your NCBIXML.py code instead of oringinal one I am
> > > getting following error messege:
> > >
> > > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
> > > in _end_Parameters_gap_open
> > > self._blast.gap_penalties[0] = int(self._value)
> > > TypeError: object does not support item assignment
> > >
> > > in the original version
> > > we don't have that " [0] " in self._blast.gap_penalties
> > >
> > > what might be causing this error?
> > > -Rohini
> > >
> > > On 7/19/06, Jacob Joseph wrote:
> > >> I do not believe the current version of the parser will work with
> > >> multiple queries using recent version of blast, regardless of the output
> > >> format. I do know that blastall 2.2.13 with XML functions with the
> > >> parser corrections previously attached. I have attached a further
> > >> updated NCBIXML.py, fixing the performance issues in parse() that I
> > >> mentioned.
> > >>
> > >> -Jacob
> > >>
> > >> Rohini Damle wrote:
> > >> > Hi,
> > >> > Can someone suggest me for which version of Blast, the Biopython's
> > >> > (text or xml) parser works fine?
> > >> > I will download that blast version locally and can use biopython's
> > >> parser.
> > >> > thanx,
> > >> > Rohini
> > >> >
> > >> > On 7/18/06, Jacob Joseph wrote:
> > >> >> Hi.
> > >> >> I encountered similar difficulties over the past few days myself and
> > >> >> have made some improvements to the XML parser. Well, that is, it now
> > >> >> functions with blastall, but I have made no effort to parse the other
> > >> >> blast programs. I do not expect I have done any harm to other
> > >> parsing,
> > >> >> however.
> > >> >>
> > >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I have not
> > >> >> yet spent significant time to clean up my changes. Without getting
> > >> into
> > >> >> specific modifications, I have made an effort to make consistent the
> > >> >> variables in Record and NCBIXML, focusing primarily on what I needed
> > >> >> this week.
> > >> >>
> > >> >> One portion I am not settled on reinitialization of Record.Blast at
> > >> >> every call to iterator.next(), and, by extension, BlastParser.parse().
> > >> >> See NCBIXML.py, line 114. Without re-initializing this class, we run
> > >> >> the risk of retaining portions of a Record from previously parsed
> > >> >> queries. This causes the bug 1970, mentioned below. Unfortunately,
> > >> >> this re-initialization exacts a significant performance penalty of at
> > >> >> least a factor of 10 by some rough measures. I would appreciate any
> > >> >> suggestions for improvement here.
> > >> >>
> > >> >> I do apologize for not being more specific about my changes. When
> > >> I get
> > >> >> a chance(next week?), I will package them up as a proper patch and
> > >> file
> > >> >> a bug. Perhaps what I have done so far will be of use until then.
> > >> >>
> > >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14 seems to
> > >> >> not have separate blocks within its output, requiring a
> > >> different
> > >> >> method of iteration.
> > >> >>
> > >> >> -Jacob
> > >> >>
> > >> >> Peter wrote:
> > >> >> > Rohini Damle wrote:
> > >> >> >> Hi,
> > >> >> >> I have a XML file with 4 blast records (for proteins P1, P2, P3,
> > >> P4)
> > >> >> >> I am trying to extract alignment information for each of them.
> > >> >> >> So I wrote the following code:
> > >> >> >>
> > >> >> >> for b_record in b_iterator :
> > >> >> >>
> > >> >> >> E_VALUE_THRESH =20
> > >> >> >> for alignment in b_record.alignments:
> > >> >> >> for hsp in alignment.hsps:
> > >> >> >> if hsp.expect< E_VALUE_THRESH:
> > >> >> >>
> > >> >> >> print '****Alignment****'
> > >> >> >> print 'sequence:',
> > >> >> alignment.title.split()[0]
> > >> >> >>
> > >> >> >> With this code, I am getting information for P1,
> > >> >> >> then information for P1 + P2
> > >> >> >> then for P1+P2 +P3
> > >> >> >> and finally for P1+P2+P3+P4
> > >> >> >> why this is so?
> > >> >> >> is there something wrong with the looping?
> > >> >> >
> > >> >> > I'm aware of something funny with the XML parsing, Bug 1970, which
> > >> >> might
> > >> >> > well be the same issue:
> > >> >> >
> > >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
> > >> >> >
> > >> >> > I confess I haven't looked into exactly what is going wrong here
> > >> - too
> > >> >> > many other demands on my time to learn about XML and how BioPython
> > >> >> > parses it.
> > >> >> >
> > >> >> > Does the work around on the bug report help? Depending on which
> > >> >> version
> > >> >> > of standalone blast you have installed, you might have better
> > >> luck with
> > >> >> > plain text output - the trouble is this is a moving target and
> > >> the NBCI
> > >> >> > keeps tweaking it.
> > >> >> >
> > >> >> > Peter
> > >> >> >
> > >> >> > _______________________________________________
> > >> >> > BioPython mailing list - BioPython at lists.open-bio.org
> > >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> BioPython mailing list - BioPython at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/biopython
> > >>
> > >>
> > >>
> > >>
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
From jmjoseph at andrew.cmu.edu Thu Jul 20 20:05:14 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Thu, 20 Jul 2006 16:05:14 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To:
References: <44B4FF41.9070608@gmail.com>
<44BD473A.6030903@maubp.freeserve.co.uk>
<44BDB9B8.2060808@jjoseph.org>
<44BE8574.1020205@andrew.cmu.edu>
<44BFC68F.8030802@andrew.cmu.edu>
Message-ID: <44BFE1FA.4040500@andrew.cmu.edu>
Great!
Can someone point me to the current maintainer of the Blast parsing package?
-Jacob
Rohini Damle wrote:
> Hi,
> I used hsp.evalue instead of hsp.expect and I am getting the desired
> output.
>
> Thank you very much for your help, efforts, and all those modified files.
> Rohini
>
> On 7/20/06, Rohini Damle wrote:
>> Hi,
>> Now I used your updated Record.py, NCBIXML.py and NcbiStandalone.py
>> (all updated)
>> I am not getting that previous error.
>> BUT I am still not getting the desired output ...
>> Here is my code
>>
>> blast_out = open("C:/Documents and Settings/rdamle/My
>> Documents/Rohini's Documents/Blast
>> Parsing/onlymouse4proteinblastout.xml", "r")
>>
>> b_parser = NCBIXML.BlastParser()
>> b_iterator = NCBIStandalone.Iterator(blast_out, b_parser)
>> E_VALUE_THRESH = 22
>>
>> for b_record in b_iterator :
>> for alignment in b_record.alignments:
>> for hsp in alignment.hsps:
>> if (hsp.expect< E_VALUE_THRESH):
>> print b_record.query.split()[0]
>> print '****Alignment****'
>> print 'sequence:',
>> alignment.title.split()[0]
>>
>>
>> with this code I was expecting to get all the alignments with
>> hsp.expect>
>> BUT I AM GETTING ALL the alignments not just the one with evalue <22
>> -Rohini.
>>
>>
>>
>>
>>
>> On 7/20/06, Jacob Joseph wrote:
>> > Hi. I suspect you are not using my updated Record.py. You'll notice
>> > that, at least for the moment, I have changed _blast.gap_penalties
>> to an
>> > array to allow assignment per item without worrying about the order of
>> > entries within the xml file. There are other ways this could be
>> > accomplished while still using a tuple.
>> >
>> > -Jacob
>> >
>> > Rohini Damle wrote:
>> > > Hi,
>> > > When I tried on your NCBIXML.py code instead of oringinal one I am
>> > > getting following error messege:
>> > >
>> > > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIXML.py", line 210,
>> > > in _end_Parameters_gap_open
>> > > self._blast.gap_penalties[0] = int(self._value)
>> > > TypeError: object does not support item assignment
>> > >
>> > > in the original version
>> > > we don't have that " [0] " in self._blast.gap_penalties
>> > >
>> > > what might be causing this error?
>> > > -Rohini
>> > >
>> > > On 7/19/06, Jacob Joseph wrote:
>> > >> I do not believe the current version of the parser will work with
>> > >> multiple queries using recent version of blast, regardless of the
>> output
>> > >> format. I do know that blastall 2.2.13 with XML functions with the
>> > >> parser corrections previously attached. I have attached a further
>> > >> updated NCBIXML.py, fixing the performance issues in parse() that I
>> > >> mentioned.
>> > >>
>> > >> -Jacob
>> > >>
>> > >> Rohini Damle wrote:
>> > >> > Hi,
>> > >> > Can someone suggest me for which version of Blast, the Biopython's
>> > >> > (text or xml) parser works fine?
>> > >> > I will download that blast version locally and can use biopython's
>> > >> parser.
>> > >> > thanx,
>> > >> > Rohini
>> > >> >
>> > >> > On 7/18/06, Jacob Joseph wrote:
>> > >> >> Hi.
>> > >> >> I encountered similar difficulties over the past few days
>> myself and
>> > >> >> have made some improvements to the XML parser. Well, that is,
>> it now
>> > >> >> functions with blastall, but I have made no effort to parse
>> the other
>> > >> >> blast programs. I do not expect I have done any harm to other
>> > >> parsing,
>> > >> >> however.
>> > >> >>
>> > >> >> Attached are Record.py, NCBIStandalone.py, and NCBIXML.py. I
>> have not
>> > >> >> yet spent significant time to clean up my changes. Without
>> getting
>> > >> into
>> > >> >> specific modifications, I have made an effort to make
>> consistent the
>> > >> >> variables in Record and NCBIXML, focusing primarily on what I
>> needed
>> > >> >> this week.
>> > >> >>
>> > >> >> One portion I am not settled on reinitialization of
>> Record.Blast at
>> > >> >> every call to iterator.next(), and, by extension,
>> BlastParser.parse().
>> > >> >> See NCBIXML.py, line 114. Without re-initializing this class,
>> we run
>> > >> >> the risk of retaining portions of a Record from previously parsed
>> > >> >> queries. This causes the bug 1970, mentioned below.
>> Unfortunately,
>> > >> >> this re-initialization exacts a significant performance
>> penalty of at
>> > >> >> least a factor of 10 by some rough measures. I would
>> appreciate any
>> > >> >> suggestions for improvement here.
>> > >> >>
>> > >> >> I do apologize for not being more specific about my changes.
>> When
>> > >> I get
>> > >> >> a chance(next week?), I will package them up as a proper patch
>> and
>> > >> file
>> > >> >> a bug. Perhaps what I have done so far will be of use until
>> then.
>> > >> >>
>> > >> >> fyi, I have done all of my testing with Blast 2.2.13. 2.2.14
>> seems to
>> > >> >> not have separate blocks within its output, requiring a
>> > >> different
>> > >> >> method of iteration.
>> > >> >>
>> > >> >> -Jacob
>> > >> >>
>> > >> >> Peter wrote:
>> > >> >> > Rohini Damle wrote:
>> > >> >> >> Hi,
>> > >> >> >> I have a XML file with 4 blast records (for proteins P1,
>> P2, P3,
>> > >> P4)
>> > >> >> >> I am trying to extract alignment information for each of them.
>> > >> >> >> So I wrote the following code:
>> > >> >> >>
>> > >> >> >> for b_record in b_iterator :
>> > >> >> >>
>> > >> >> >> E_VALUE_THRESH =20
>> > >> >> >> for alignment in b_record.alignments:
>> > >> >> >> for hsp in alignment.hsps:
>> > >> >> >> if hsp.expect< E_VALUE_THRESH:
>> > >> >> >>
>> > >> >> >> print '****Alignment****'
>> > >> >> >> print 'sequence:',
>> > >> >> alignment.title.split()[0]
>> > >> >> >>
>> > >> >> >> With this code, I am getting information for P1,
>> > >> >> >> then information for P1 + P2
>> > >> >> >> then for P1+P2 +P3
>> > >> >> >> and finally for P1+P2+P3+P4
>> > >> >> >> why this is so?
>> > >> >> >> is there something wrong with the looping?
>> > >> >> >
>> > >> >> > I'm aware of something funny with the XML parsing, Bug 1970,
>> which
>> > >> >> might
>> > >> >> > well be the same issue:
>> > >> >> >
>> > >> >> > http://bugzilla.open-bio.org/show_bug.cgi?id=1970
>> > >> >> >
>> > >> >> > I confess I haven't looked into exactly what is going wrong
>> here
>> > >> - too
>> > >> >> > many other demands on my time to learn about XML and how
>> BioPython
>> > >> >> > parses it.
>> > >> >> >
>> > >> >> > Does the work around on the bug report help? Depending on
>> which
>> > >> >> version
>> > >> >> > of standalone blast you have installed, you might have better
>> > >> luck with
>> > >> >> > plain text output - the trouble is this is a moving target and
>> > >> the NBCI
>> > >> >> > keeps tweaking it.
>> > >> >> >
>> > >> >> > Peter
>> > >> >> >
>> > >> >> > _______________________________________________
>> > >> >> > BioPython mailing list - BioPython at lists.open-bio.org
>> > >> >> > http://lists.open-bio.org/mailman/listinfo/biopython
>> > >>
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> BioPython mailing list - BioPython at lists.open-bio.org
>> > >> http://lists.open-bio.org/mailman/listinfo/biopython
>> > >>
>> > >>
>> > >>
>> > >>
>> > _______________________________________________
>> > BioPython mailing list - BioPython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
From karbak at gmail.com Sun Jul 23 06:12:51 2006
From: karbak at gmail.com (K. Arun)
Date: Sun, 23 Jul 2006 02:12:51 -0400
Subject: [BioPython] KDTree optional compilation
Message-ID: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>
Hello,
I just updated to version 1.42 (thanks !), and noticed I had to
uncomment the KDTree entry in setup.py in the NUMPY_PACKAGES section
to use Bio.PDB.NeighborSearch. Since KDTree is essential to this
module, would it possible to add a note about this in the README, or
print a message during compilation to the effect that setup.py needs
to be modified ?
-arun
From mdehoon at c2b2.columbia.edu Sun Jul 23 16:11:52 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 23 Jul 2006 12:11:52 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44BFE1FA.4040500@andrew.cmu.edu>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
Message-ID: <44C39FC8.3060002@c2b2.columbia.edu>
Jacob Joseph wrote:
> Great!
>
> Can someone point me to the current maintainer of the Blast parsing package?
>
> -Jacob
>
As far as I know, we don't have an official maintainer of the Blast
package. So your best bet is to submit a patch through bugzilla.
--Michiel.
From jdiezperezj at gmail.com Mon Jul 24 15:01:49 2006
From: jdiezperezj at gmail.com (=?ISO-8859-1?Q?Javier_D=EDez?=)
Date: Mon, 24 Jul 2006 17:01:49 +0200
Subject: [BioPython] parse xml from flybase ?
In-Reply-To:
References:
Message-ID:
Hy everybody:
I'm new in biopython (congratulations to everyone is involved in
development).
For the first time i'm trying to parse some xml documents from flybase, but
I could not fully parse any big document.
I am using SAX, because because the documents I am parsing are bigger than
70- 100 MB.
My parser allways find something bad-formed in big documents, so I wrote a
little script too, to validates xml documents, (in fact I copied almost
completely from an on line manual), and it "tell me" that they are bad
formed too. I think that both script works, because I tested with
well-formed xml documents and they worked ok.
So, I have two questions:
Have anybody worked with flybase reports xml documents?
The second one, could you tell me if there is any python package to automate
bulk data retrieval (or big queries) from FlyBase?
Thank you
Javi
From jmjoseph at andrew.cmu.edu Mon Jul 24 20:27:39 2006
From: jmjoseph at andrew.cmu.edu (Jacob Joseph)
Date: Mon, 24 Jul 2006 16:27:39 -0400
Subject: [BioPython] import Standalone problems
In-Reply-To: <44C39FC8.3060002@c2b2.columbia.edu>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
<44C39FC8.3060002@c2b2.columbia.edu>
Message-ID: <44C52D3B.4050704@andrew.cmu.edu>
Okay. I've started bug 2051:
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
I would greatly appreciate any comments, additions, and suggestions.
-Jacob
Michiel de Hoon wrote:
> Jacob Joseph wrote:
>> Great!
>>
>> Can someone point me to the current maintainer of the Blast parsing
>> package?
>>
>> -Jacob
>>
>
> As far as I know, we don't have an official maintainer of the Blast
> package. So your best bet is to submit a patch through bugzilla.
>
> --Michiel.
From manickam.muthuraman at wur.nl Wed Jul 26 15:12:12 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Wed, 26 Jul 2006 17:12:12 +0200
Subject: [BioPython] blast for more than one sequence
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu>
<44BFE1FA.4040500@andrew.cmu.edu>
<44C39FC8.3060002@c2b2.columbia.edu>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
hai all
I have tried to blast through the internet, i can suceed with one sequence but i need to blast more than one sequences through internet.
for example
m_cold.fasta has 5 sequences but the code which is been stated in book can do only for the first sequence and rest it does not
file_for_blast=open('m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
#does the above line loop through the file and get each sequence for blasting, if so let me know how
i also tried another way but i do not want to confuse you by sending the whole script.
from
manickam
From biopython at maubp.freeserve.co.uk Wed Jul 26 15:51:03 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Wed, 26 Jul 2006 16:51:03 +0100
Subject: [BioPython] blast for more than one sequence
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
References: <44B4FF41.9070608@gmail.com> <44BD473A.6030903@maubp.freeserve.co.uk> <44BDB9B8.2060808@jjoseph.org> <44BE8574.1020205@andrew.cmu.edu> <44BFC68F.8030802@andrew.cmu.edu> <44BFE1FA.4040500@andrew.cmu.edu> <44C39FC8.3060002@c2b2.columbia.edu>
<4CDD243B32D07748944828EA7A29E4A3E2AFCD@salte0008.wurnet.nl>
Message-ID: <44C78F67.9090901@maubp.freeserve.co.uk>
Muthuraman, Manickam wrote:
>
> hai all
>
> I have tried to blast through the internet, i can suceed with one
> sequence but i need to blast more than one sequences through
> internet.
I think you can only blast one sequence at a time. If you really want
to do many at once, then I personally would use standalone blast and
give it the fasta file as input.
> for example
> m_cold.fasta has 5 sequences but the code which is been stated in book
> can do only for the first sequence and rest it does not
>
> file_for_blast=open('m_cold.fasta','r')
> f_iterator=Fasta.Iterator(file_for_blast)
> #does the above line loop through the file and get each sequence for
> blasting, if so let me know how
You could use the iterator to get each FASTA sequence in turn, and run a
separate blast for it. Something like this...
from Bio import Fasta
from Bio.Blast import NCBIWWW
file_for_blast=open('m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
for f_record in f_iterator :
#do blast with f_record...
#using NCBIWWW.qblast('blastn', 'nr', f_record)
Peter
From persikan at gmail.com Thu Jul 27 19:16:10 2006
From: persikan at gmail.com (Anton)
Date: Thu, 27 Jul 2006 19:16:10 +0000 (UTC)
Subject: [BioPython] KDTree optional compilation
References: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>
Message-ID:
K. Arun gmail.com> writes:
>
> Hello,
>
> I just updated to version 1.42 (thanks !), and noticed I had to
> uncomment the KDTree entry in setup.py in the NUMPY_PACKAGES section
> to use Bio.PDB.NeighborSearch. Since KDTree is essential to this
> module, would it possible to add a note about this in the README, or
> print a message during compilation to the effect that setup.py needs
> to be modified ?
>
> -arun
> _______________________________________________
> BioPython mailing list - BioPython lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
I installed my BioPython using the Windows installer. Therefore the
NeighborSearch module is not working (requires KDTree). In fact there was no way
during the installation to add this module.
Can I install the KDTree somehow or I have to reinstall the BioPython using the
source code with uncommenting the KDTree entry?
I'm trying to avoid the installation from the source code.
Any suggestions?
Thanks!
From mdehoon at c2b2.columbia.edu Thu Jul 27 22:01:32 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 27 Jul 2006 18:01:32 -0400
Subject: [BioPython] KDTree optional compilation
In-Reply-To:
References: <162452a10607222312j70ca3ae7h103a8d36c04427@mail.gmail.com>