From jttkim at googlemail.com Mon Sep 3 05:31:23 2012 From: jttkim at googlemail.com (Jan T Kim) Date: Mon, 3 Sep 2012 10:31:23 +0100 Subject: [Biopython] Start positions for local pairwise alignments? Message-ID: <20120903093121.GA4129@paxarchia.galaxy.uni> Dear All, after reading a pairwise alignment computed using the EMBOSS water program, is it possible to find out the indices of the sequences in the local alignment within the input sequences? As an illustration, the sequences "tttagagccc" and "ccagagc" align to s1 4 agagc 8 ||||| s2 3 agagc 7 This local alignment doesn't contain the prefixes "ttt" and "cc", respectively. In the water output above, that's reflected by the start indices 4 and 3, respectively. However, after reading that result with import Bio.AlignIO aStream = Bio.AlignIO.parse('s1s2_align.txt', 'emboss') a = aStream.next() print a print a.__dict__ print a[0] print a[0].__dict__ I can't seem to find that information anywhere either in the resulting Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects that it contains. So, am I looking at the wrong place? Best regards, Jan P.S.: For a while I was convinced that I had seen these indices but it's now occurred to me that that was actually in the pysam.AlignedRead class, which contains the indices of the read in the reference sequence, in the positions instance variable... -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Wed Sep 5 20:01:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:01:46 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: <20120903093121.GA4129@paxarchia.galaxy.uni> References: <20120903093121.GA4129@paxarchia.galaxy.uni> Message-ID: On Mon, Sep 3, 2012 at 10:31 AM, Jan T Kim wrote: > Dear All, > > after reading a pairwise alignment computed using the EMBOSS water > program, is it possible to find out the indices of the sequences in > the local alignment within the input sequences? > > ... > > I can't seem to find that information anywhere either in the resulting > Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects > that it contains. > > So, am I looking at the wrong place? No, these number are not currently being parsed. This applies to some of the other file formats in AlignIO too, because we (still) don't have an agreed way to store this in our object model. Last time I used this parser, I was probably using needle rather than water, where these are global alignments so you don't need the start/end values. Peter From jocelyne at gmail.com Thu Sep 6 16:31:06 2012 From: jocelyne at gmail.com (Jocelyne) Date: Thu, 6 Sep 2012 13:31:06 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References: Message-ID: Hello: First, I'd like to say that I really appreciate the effort of the community to provide us with such a nice package. I found some odd scoring behavior with the pairwise2 local alignment (see 5 below). I think these 2 alignments should have the same score. First required details: 1) Which operating system and hardware (32 bit or 64 bit) you are using Linux jocelyne-VirtualBox 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux (basically, latest Ubuntu 64 bit through virtual box) 2) Python version Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 3) Biopython version (or git version/date) (installed through repository) python-biopython/precise uptodate 1.58-1 4) Traceback that occurs (the full error message) None 5) A data file that causes the problem None 6) Example code that breaks I feel the two local alignments should both score 4. I think it has to do with how the top row and left columns are filled in the score matrix. ================================================================================ >>> for a in pairwise2.align.localms("ACTGAGT", "TGC", 2, -1, -100, -100, force_generic = True): ... print a ... print pairwise2.format_alignment(*a)... ('ACTGAGT', '--TGC--', 4, 2, 4) ACTGAGT || --TGC-- Score=4 >>> for a in pairwise2.align.localms("ACTGAGT", "CGA", 2, -1, -100, -100, force_generic = True): ... print a ... print pairwise2.format_alignment(*a)... ('ACTGAGT', '--CGA--', 3, 3, 5) ACTGAGT || --CGA-- Score=3 ================================================================================ I outputted the matrices ================================================================================ >>> score_matrix, trace_matrix = pairwise2._make_score_matrix_generic("ACTGAGT", "TGC", pairwise2.identity_match(2, -1), pairwise2.affine_penalty(-100, -100), pairwise2.affine_penalty(-100, -100), False, False,False, False) >>> pairwise2.print_matrix(score_matrix)-1 -1 -1 -1 0 1 2 0 0 -1 4 0 -1 0 3 -1 1 0 2 0 0 >>> score_matrix, trace_matrix = pairwise2._make_score_matrix_generic("ACTGAGT", "CGA", pairwise2.identity_match(2, -1), pairwise2.affine_penalty(-100, -100), pairwise2.affine_penalty(-100, -100), False, False,False, False) >>> pairwise2.print_matrix(score_matrix)-1 -1 2 2 0 0 -1 1 0 -1 1 0 -1 0 3 -1 1 0 -1 0 0 Let me know if there is a quick fix I can do on my side. Thanks! Jocelyne From p.j.a.cock at googlemail.com Thu Sep 6 23:59:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 04:59:57 +0100 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > Hello: > First, I'd like to say that I really appreciate the effort of the community > to provide us with such a nice package. > I found some odd scoring behavior with the pairwise2 local alignment (see 5 > below). I think these 2 alignments should have the same score. Hmm. I'm not overly familiar with this bit of the code, but did occur to me it might be something related to this open issue: https://redmine.open-bio.org/issues/2776 I was able to repeat your pairwise2.align.localms example and the score matrix example a Mac using the latest code from github, and got the same answers. So (as I suspected) this does not seem to be a platform specific issue. Unfortunately the original author of this code (Jeff Chang) isn't active with Biopython anymore - we can try emailing him directly, but if you're willing to look into this in more detail and can propose a fix, I'm happy to take a look at merging it. Peter From jocelyne at gmail.com Fri Sep 7 02:15:59 2012 From: jocelyne at gmail.com (Jocelyne) Date: Thu, 6 Sep 2012 23:15:59 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Peter: I added 4 lines of code in each snippet below (there are copies of the same code). I'm pretty sure it should fix it (there are copies of line 438-439, with the indexes changed). Basically, the previous code allowed for negative scores in the first row and column of the matrix, even in the case of local alignments (in which case scores shouldn't go negative). I didn't test it, so please make sure it works before merging. Also, it seems that it imports _make_score_matrix_fasta from a C library (line 851), which overload the corresponding python function, so that would have to be fixed too. Thanks! Jocelyne 378 # The top and left borders of the matrices are special cases 379 # because there are no previously aligned characters. To simplify 380 # the main loop, handle these separately. 381 for i in range(lenA): 382 # Align the first residue in sequenceB to the ith residue in 383 # sequence A. This is like opening up i gaps at the beginning 384 # of sequence B. 385 score = match_fn(sequenceA[i], sequenceB[0]) 386 if penalize_end_gaps: 387 score += gap_B_fn(0, i) 388 score_matrix[i][0] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[i][0] = 0 389 for i in range(1, lenB): 390 score = match_fn(sequenceA[0], sequenceB[i]) 391 if penalize_end_gaps: 392 score += gap_A_fn(0, i) 393 score_matrix[0][i] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[0][i] = 0 461 # The top and left borders of the matrices are special cases 462 # because there are no previously aligned characters. To simplify 463 # the main loop, handle these separately. 464 for i in range(lenA): 465 # Align the first residue in sequenceB to the ith residue in 466 # sequence A. This is like opening up i gaps at the beginning 467 # of sequence B. 468 score = match_fn(sequenceA[i], sequenceB[0]) 469 if penalize_end_gaps: 470 score += calc_affine_penalty( 471 i, open_B, extend_B, penalize_extend_when_opening) 472 score_matrix[i][0] = score +++ if not align_globally and score_matrix[i][0] < 0: +++ score_matrix[i][0] = 0 473 for i in range(1, lenB): 474 score = match_fn(sequenceA[0], sequenceB[i]) 475 if penalize_end_gaps: 476 score += calc_affine_penalty( 477 i, open_A, extend_A, penalize_extend_when_opening) 478 score_matrix[0][i] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[0][i] = 0 On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > > Hello: > > First, I'd like to say that I really appreciate the effort of the > community > > to provide us with such a nice package. > > I found some odd scoring behavior with the pairwise2 local alignment > (see 5 > > below). I think these 2 alignments should have the same score. > > Hmm. I'm not overly familiar with this bit of the code, but did > occur to me it might be something related to this open issue: > > https://redmine.open-bio.org/issues/2776 > > I was able to repeat your pairwise2.align.localms example > and the score matrix example a Mac using the latest code > from github, and got the same answers. So (as I suspected) > this does not seem to be a platform specific issue. > > Unfortunately the original author of this code (Jeff Chang) > isn't active with Biopython anymore - we can try emailing > him directly, but if you're willing to look into this in more > detail and can propose a fix, I'm happy to take a look at > merging it. > > Peter > From jttkim at googlemail.com Fri Sep 7 04:51:36 2012 From: jttkim at googlemail.com (Jan T Kim) Date: Fri, 7 Sep 2012 09:51:36 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: References: <20120903093121.GA4129@paxarchia.galaxy.uni> Message-ID: <20120907085134.GA4094@paxarchia.galaxy.uni> On Thu, Sep 06, 2012 at 01:01:46AM +0100, Peter Cock wrote: > On Mon, Sep 3, 2012 at 10:31 AM, Jan T Kim wrote: > > Dear All, > > > > after reading a pairwise alignment computed using the EMBOSS water > > program, is it possible to find out the indices of the sequences in > > the local alignment within the input sequences? > > > > ... > > > > I can't seem to find that information anywhere either in the resulting > > Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects > > that it contains. > > > > So, am I looking at the wrong place? > > No, these number are not currently being parsed. This applies to > some of the other file formats in AlignIO too, because we (still) > don't have an agreed way to store this in our object model. Ok, thanks for clarifying. I think I understand, I wasn't sure whether to expect that information in the Seq, the SeqRecord or the MultipleAlignment objects. For what it's worth, it currently would seem most adequate to me if a (say) AlignedSeq subclass of Seq could provide a couple of optional additional instance variables, such as the start index of the aligned sequence within the input sequence. I'd envision this information to be optional in the sense that the instance variable would be None if the start position is not available, which would obviously be the case for some alignment formats (for most multiple alignments, in fact). > Last time I used this parser, I was probably using needle rather > than water, where these are global alignments so you don't need > the start/end values. Incidentally I initially used needle as well, but then got additional data which contained elevated levels of "junk", which required a switch to local alignments. In this case there was a region of interest with a subsequence that was unique, so I could figure out whether the region of interest was aligned or not, but that approach can be unreliable when repetitive regions are involved and / or definitions of the "region of interest" are subject to shifts. So I'd think having the start index where available would be useful in the long run. Best regards & have a nice weekend all, Jan -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Fri Sep 7 10:47:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 15:47:00 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: <20120907085134.GA4094@paxarchia.galaxy.uni> References: <20120903093121.GA4129@paxarchia.galaxy.uni> <20120907085134.GA4094@paxarchia.galaxy.uni> Message-ID: On Fri, Sep 7, 2012 at 9:51 AM, Jan T Kim wrote: > > Ok, thanks for clarifying. I think I understand, I wasn't sure whether to > expect that information in the Seq, the SeqRecord or the MultipleAlignment > objects. > > For what it's worth, it currently would seem most adequate to me > if a (say) AlignedSeq subclass of Seq could provide a couple of > optional additional instance variables, such as the start index > of the aligned sequence within the input sequence. > > I'd envision this information to be optional in the sense that the > instance variable would be None if the start position is not > available, which would obviously be the case for some alignment > formats (for most multiple alignments, in fact). > That's exactly what we hope to have in the next release, see: http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009930.html Regards, Peter From semenko at alum.mit.edu Mon Sep 10 21:18:25 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 10 Sep 2012 20:18:25 -0500 Subject: [Biopython] Removing HotRand.py? Message-ID: I've submitted a pull request to deprecate Bio/HotRand.py Is anyone still using the HotRandom functions? It looks like the module is pretty old and there are some better alternatives: http://pypi.python.org/pypi/randomdotorg/ Pull Request: https://github.com/biopython/biopython/pull/69 Best, Nick Semenkovich -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From hernan.morales at gmail.com Tue Sep 11 09:38:12 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Tue, 11 Sep 2012 15:38:12 +0200 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87y5mhkxtc.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: 2012/7/18 Brad Chapman > > Dilara; > > > I'm trying to understand what is why when I print filtered_rec I get a > > SeqRecord but if I try to access any particular attribute of a SeqRecord > > such as letter_annotations I sometimes get an attribute error -- > > AttributeError: 'NoneType' object has no attribute > > 'letter_annotations.' > > > def check_meanQ(record, q_threshold): > > seqlen=len(record) > > quality_scores=array(record.letter_annotations["phred_quality"]) > > if round(quality_scores.mean()) <= q_threshold: > > print "Discarded ", record.id, "because mean Q was", > > round(quality_scores.mean()) > > elif round(quality_scores.mean()) > q_threshold: > > return record > > This function returns different results based on the comparison of > mean quality scores to your threshold: > > - When it is below the threshold, it returns None (since you do not > define an explicit return value) > - When it is above the threshold, it returns a SeqRecord. > > And of course, you may implement a Null Object Pattern here, like a NullSeqRecord. Cheers, Hern?n From hernan.morales at gmail.com Tue Sep 11 09:38:12 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Tue, 11 Sep 2012 15:38:12 +0200 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87y5mhkxtc.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: 2012/7/18 Brad Chapman > > Dilara; > > > I'm trying to understand what is why when I print filtered_rec I get a > > SeqRecord but if I try to access any particular attribute of a SeqRecord > > such as letter_annotations I sometimes get an attribute error -- > > AttributeError: 'NoneType' object has no attribute > > 'letter_annotations.' > > > def check_meanQ(record, q_threshold): > > seqlen=len(record) > > quality_scores=array(record.letter_annotations["phred_quality"]) > > if round(quality_scores.mean()) <= q_threshold: > > print "Discarded ", record.id, "because mean Q was", > > round(quality_scores.mean()) > > elif round(quality_scores.mean()) > q_threshold: > > return record > > This function returns different results based on the comparison of > mean quality scores to your threshold: > > - When it is below the threshold, it returns None (since you do not > define an explicit return value) > - When it is above the threshold, it returns a SeqRecord. > > And of course, you may implement a Null Object Pattern here, like a NullSeqRecord. Cheers, Hern?n From p.j.a.cock at googlemail.com Tue Sep 11 12:03:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Sep 2012 17:03:21 +0100 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Jocelyne, The reason for the C code is speed. The pure Python code is a fall back for systems where you can't use this - for example PyPy or Jython. To test your (pure Python) fix you'd have to comment out the C library import line. Ideally we'd prefer a combined fix which also updates the C implementation to match. Are you on Windows? That does complicate this - whereas for the Mac or Linux (re)compiling Biopython from source should be quite easy. Peter On Fri, Sep 7, 2012 at 7:15 AM, Jocelyne wrote: > Hi Peter: > I added 4 lines of code in each snippet below (there are copies of the same > code). I'm pretty sure it should fix it (there are copies of line 438-439, > with the indexes changed). Basically, the previous code allowed for negative > scores in the first row and column of the matrix, even in the case of local > alignments (in which case scores shouldn't go negative). I didn't test it, > so please make sure it works before merging. > > Also, it seems that it imports _make_score_matrix_fasta from a C library > (line 851), which overload the corresponding python function, so that would > have to be fixed too. > > Thanks! > Jocelyne > > > > 378 # The top and left borders of the matrices are special cases > 379 # because there are no previously aligned characters. To simplify > 380 # the main loop, handle these separately. > 381 for i in range(lenA): > 382 # Align the first residue in sequenceB to the ith residue in > 383 # sequence A. This is like opening up i gaps at the beginning > 384 # of sequence B. > 385 score = match_fn(sequenceA[i], sequenceB[0]) > 386 if penalize_end_gaps: > 387 score += gap_B_fn(0, i) > 388 score_matrix[i][0] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[i][0] = 0 > 389 for i in range(1, lenB): > 390 score = match_fn(sequenceA[0], sequenceB[i]) > 391 if penalize_end_gaps: > 392 score += gap_A_fn(0, i) > 393 score_matrix[0][i] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[0][i] = 0 > > > 461 # The top and left borders of the matrices are special cases > 462 # because there are no previously aligned characters. To simplify > 463 # the main loop, handle these separately. > 464 for i in range(lenA): > 465 # Align the first residue in sequenceB to the ith residue in > 466 # sequence A. This is like opening up i gaps at the beginning > 467 # of sequence B. > 468 score = match_fn(sequenceA[i], sequenceB[0]) > 469 if penalize_end_gaps: > 470 score += calc_affine_penalty( > 471 i, open_B, extend_B, penalize_extend_when_opening) > 472 score_matrix[i][0] = score > +++ if not align_globally and score_matrix[i][0] < 0: > +++ score_matrix[i][0] = 0 > 473 for i in range(1, lenB): > 474 score = match_fn(sequenceA[0], sequenceB[i]) > 475 if penalize_end_gaps: > 476 score += calc_affine_penalty( > 477 i, open_A, extend_A, penalize_extend_when_opening) > 478 score_matrix[0][i] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[0][i] = 0 > > > > On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock > wrote: >> >> On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: >> > Hello: >> > First, I'd like to say that I really appreciate the effort of the >> > community >> > to provide us with such a nice package. >> > I found some odd scoring behavior with the pairwise2 local alignment >> > (see 5 >> > below). I think these 2 alignments should have the same score. >> >> Hmm. I'm not overly familiar with this bit of the code, but did >> occur to me it might be something related to this open issue: >> >> https://redmine.open-bio.org/issues/2776 >> >> I was able to repeat your pairwise2.align.localms example >> and the score matrix example a Mac using the latest code >> from github, and got the same answers. So (as I suspected) >> this does not seem to be a platform specific issue. >> >> Unfortunately the original author of this code (Jeff Chang) >> isn't active with Biopython anymore - we can try emailing >> him directly, but if you're willing to look into this in more >> detail and can propose a fix, I'm happy to take a look at >> merging it. >> >> Peter > > From jocelyne at gmail.com Tue Sep 11 21:41:31 2012 From: jocelyne at gmail.com (Jocelyne) Date: Tue, 11 Sep 2012 18:41:31 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Peter: I understand the C code would need to be fixed too, and it should be a fairly quick fix, but I unfortunately don't have much time on my hands at the moment. When I have more time, I'll see about fixing the bug in both python and C, recompiling and testing. I thought it would be good for the community to at least be aware of this bug. Jocelyne On Tue, Sep 11, 2012 at 9:03 AM, Peter Cock wrote: > Hi Jocelyne, > > The reason for the C code is speed. The pure Python code is a fall > back for systems where you can't use this - for example PyPy or Jython. > > To test your (pure Python) fix you'd have to comment out the C library > import line. Ideally we'd prefer a combined fix which also updates the > C implementation to match. Are you on Windows? That does complicate > this - whereas for the Mac or Linux (re)compiling Biopython from source > should be quite easy. > > Peter > > On Fri, Sep 7, 2012 at 7:15 AM, Jocelyne wrote: > > Hi Peter: > > I added 4 lines of code in each snippet below (there are copies of the > same > > code). I'm pretty sure it should fix it (there are copies of line > 438-439, > > with the indexes changed). Basically, the previous code allowed for > negative > > scores in the first row and column of the matrix, even in the case of > local > > alignments (in which case scores shouldn't go negative). I didn't test > it, > > so please make sure it works before merging. > > > > Also, it seems that it imports _make_score_matrix_fasta from a C library > > (line 851), which overload the corresponding python function, so that > would > > have to be fixed too. > > > > Thanks! > > Jocelyne > > > > > > > > 378 # The top and left borders of the matrices are special cases > > 379 # because there are no previously aligned characters. To > simplify > > 380 # the main loop, handle these separately. > > 381 for i in range(lenA): > > 382 # Align the first residue in sequenceB to the ith residue in > > 383 # sequence A. This is like opening up i gaps at the > beginning > > 384 # of sequence B. > > 385 score = match_fn(sequenceA[i], sequenceB[0]) > > 386 if penalize_end_gaps: > > 387 score += gap_B_fn(0, i) > > 388 score_matrix[i][0] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[i][0] = 0 > > 389 for i in range(1, lenB): > > 390 score = match_fn(sequenceA[0], sequenceB[i]) > > 391 if penalize_end_gaps: > > 392 score += gap_A_fn(0, i) > > 393 score_matrix[0][i] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[0][i] = 0 > > > > > > 461 # The top and left borders of the matrices are special cases > > 462 # because there are no previously aligned characters. To > simplify > > 463 # the main loop, handle these separately. > > 464 for i in range(lenA): > > 465 # Align the first residue in sequenceB to the ith residue in > > 466 # sequence A. This is like opening up i gaps at the > beginning > > 467 # of sequence B. > > 468 score = match_fn(sequenceA[i], sequenceB[0]) > > 469 if penalize_end_gaps: > > 470 score += calc_affine_penalty( > > 471 i, open_B, extend_B, penalize_extend_when_opening) > > 472 score_matrix[i][0] = score > > +++ if not align_globally and score_matrix[i][0] < 0: > > +++ score_matrix[i][0] = 0 > > 473 for i in range(1, lenB): > > 474 score = match_fn(sequenceA[0], sequenceB[i]) > > 475 if penalize_end_gaps: > > 476 score += calc_affine_penalty( > > 477 i, open_A, extend_A, penalize_extend_when_opening) > > 478 score_matrix[0][i] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[0][i] = 0 > > > > > > > > On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock > > wrote: > >> > >> On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > >> > Hello: > >> > First, I'd like to say that I really appreciate the effort of the > >> > community > >> > to provide us with such a nice package. > >> > I found some odd scoring behavior with the pairwise2 local alignment > >> > (see 5 > >> > below). I think these 2 alignments should have the same score. > >> > >> Hmm. I'm not overly familiar with this bit of the code, but did > >> occur to me it might be something related to this open issue: > >> > >> https://redmine.open-bio.org/issues/2776 > >> > >> I was able to repeat your pairwise2.align.localms example > >> and the score matrix example a Mac using the latest code > >> from github, and got the same answers. So (as I suspected) > >> this does not seem to be a platform specific issue. > >> > >> Unfortunately the original author of this code (Jeff Chang) > >> isn't active with Biopython anymore - we can try emailing > >> him directly, but if you're willing to look into this in more > >> detail and can propose a fix, I'm happy to take a look at > >> merging it. > >> > >> Peter > > > > > From mmokrejs at fold.natur.cuni.cz Thu Sep 13 11:20:13 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Sep 2012 17:20:13 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? Message-ID: <5051F9AD.104@fold.natur.cuni.cz> Hi, I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files which are then parsed by from Bio.Blast import NCBIXML _blastn_fileh = open(blast_out_xml_filename) _blastn_iterator = NCBIXML.parse(_blastn_fileh) _record = _blastn_iterator.next() # fetch the very first BLAST result from generator In my case the blastn searches seem to take longer than takes the XML parsing. :( I do not have timing numbers here but wonder why is cElementTree used only in Uniprot biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? Isn't there any argument when setup.py is called to discern between elementtree, cElementTree which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) or somebody else will know right away where to look for a performance bottleneck and where to change code to use cElementTree which always seemed the fastest to me. Thank you for some initial advice. Martin P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, the XML is really an overkill. From w.arindrarto at gmail.com Thu Sep 13 11:40:41 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Sep 2012 17:40:41 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5051F9AD.104@fold.natur.cuni.cz> References: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: Hi Martin, There is actually already a faster BLAST XML parser written using cElementTree in Biopython :) (although it's yet to be included in the main branch). It's part of Biopython's SearchIO module that I recently wrote (the name SearchIO might change in the future). And indeed, my early benchmarks has shown that it does perform faster. This branch is available here: https://github.com/bow/biopython/tree/searchio. I've also written a draft tutorial on how to use it here: http://bow.web.id/biopython/Tutorial.html#htoc96. However, as it's not yet in the current branch, you need to do a little bit of command line work to set it up: 1. Set up a new virtualenv environment (so that it doesn't clash with your other Biopython installation) and activate it. 2. Clone the repository: `git clone https://github.com/bow/biopython.git`, checkout the 'searchio' branch 3. Run `python setup.py develop`. This will keep the installation in-sync with any future `git pull` you might perform on the branch. Hope this helps :), Bow On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs wrote: > Hi, > I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files > which are then parsed by > > from Bio.Blast import NCBIXML > _blastn_fileh = open(blast_out_xml_filename) > _blastn_iterator = NCBIXML.parse(_blastn_fileh) > _record = _blastn_iterator.next() # fetch the very first BLAST result from generator > > In my case the blastn searches seem to take longer than takes the XML parsing. :( > I do not have timing numbers here but wonder why is cElementTree used only in Uniprot > biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? > Isn't there any argument when setup.py is called to discern between elementtree, cElementTree > which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) > or somebody else will know right away where to look for a performance bottleneck > and where to change code to use cElementTree which always seemed the fastest to me. > Thank you for some initial advice. > Martin > P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, > the XML is really an overkill. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Thu Sep 13 20:37:08 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 Sep 2012 17:37:08 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> --- On Thu, 9/13/12, Martin Mokrejs wrote: > P.S.: And yes, I would love to parse blastn plaintext output > or some other more compact one, the XML is really an overkill. What exactly is the advantage of plain text parsing compared to XML? File size? Best, -Michiel. From cjfields at illinois.edu Thu Sep 13 21:32:19 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 14 Sep 2012 01:32:19 +0000 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> On Sep 13, 2012, at 7:37 PM, Michiel de Hoon wrote: > --- On Thu, 9/13/12, Martin Mokrejs wrote: >> P.S.: And yes, I would love to parse blastn plaintext output >> or some other more compact one, the XML is really an overkill. > > What exactly is the advantage of plain text parsing compared to XML? File size? > > Best, > -Michiel. There isn't any. In fact, NCBI has consistently stated that one should never rely on parsing BLAST text output, primarily b/c they reserve the right to make changes to the output at any given point, whereas XML output should remain stable. As someone who has taken care of legacy BLAST code for a number of years (BioPerl), I can state that is fairly close to the truth (the caveat being they have made changes that break some XML parsing, but they do try to fix them). BLAST XML has simply been much easier to deal with in terms of fixing issues than text. chris From mmokrejs at fold.natur.cuni.cz Fri Sep 14 04:12:10 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 10:12:10 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> Message-ID: <5052E6DA.7080703@fold.natur.cuni.cz> Hi all, as a long-term subscriber to this list and bioperl in the past as well I do know that the plaintext output is being changed silently and that it is a hassle to maintainers. On the other hand, the XML tags and syntax is way too verbose. That in turn means lots of disc&memory IO, long parsing times and of course file size. At least if the XML tags would be scrambled to be shorter strings. ;-) Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: https://redmine.open-bio.org/issues/3354 A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271 sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB in size. Parsing the XML file through biopython takes 56 minutes to convert the results into my own CSV file (some overhead could be my program, sure). Doing a full Smith-Waterman search using 8 queries takes just 126 minutes. The times are from filestamps so it is a wall-clock time. I will try to find some time in a week or so and do run profiling using runsnake (http://www.vrplumber.com/programming/runsnakerun/). And test the new parser from Wibowo and report back. ;-) With plaintext I actually meant more some tabular output format which would be enough for my purposes (match and query coordinates, scores, gaps, identities). Martin Fields, Christopher J wrote: > On Sep 13, 2012, at 7:37 PM, Michiel de Hoon > wrote: > >> --- On Thu, 9/13/12, Martin Mokrejs wrote: >>> P.S.: And yes, I would love to parse blastn plaintext output >>> or some other more compact one, the XML is really an overkill. >> >> What exactly is the advantage of plain text parsing compared to XML? File size? >> >> Best, >> -Michiel. > > There isn't any. In fact, NCBI has consistently stated that one should never rely on parsing BLAST text output, primarily b/c they reserve the right to make changes to the output at any given point, whereas XML output should remain stable. As someone who has taken care of legacy BLAST code for a number of years (BioPerl), I can state that is fairly close to the truth (the caveat being they have made changes that break some XML parsing, but they do try to fix them). BLAST XML has simply been much easier to deal with in terms of fixing issues than text. > > chris > > From p.j.a.cock at googlemail.com Fri Sep 14 04:31:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Sep 2012 09:31:31 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052E6DA.7080703@fold.natur.cuni.cz> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs wrote: > Hi all, > as a long-term subscriber to this list and bioperl in the past as well I do know > that the plaintext output is being changed silently and that it is a hassle to > maintainers. On the other hand, the XML tags and syntax is way too verbose. > That in turn means lots of disc&memory IO, long parsing times and of course file size. > At least if the XML tags would be scrambled to be shorter strings. ;-) > Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: > https://redmine.open-bio.org/issues/3354 Earlier this week the NCBI released BLAST 2.2.27+ which might fix this... > A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271 > sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search > using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB > in size. Parsing the XML file through biopython takes 56 minutes to convert the > results into my own CSV file (some overhead could be my program, sure). Doing > a full Smith-Waterman search using 8 queries takes just 126 minutes. The times > are from filestamps so it is a wall-clock time. I will try to find some time in > a week or so and do run profiling using runsnake > (http://www.vrplumber.com/programming/runsnakerun/). > And test the new parser from Wibowo and report back. ;-) Great :) > With plaintext I actually meant more some tabular output format which would > be enough for my purposes (match and query coordinates, scores, gaps, identities). > I find the BLAST+ tabular output very useful - you can control which columns you get if the default 12 are not enough - and trivial to parse. This is also supported in Bow's SearchIO branch. Peter From mjldehoon at yahoo.com Fri Sep 14 05:27:50 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 14 Sep 2012 02:27:50 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> Hi Martin, --- On Fri, 9/14/12, Martin Mokrejs wrote: > Legacy blastn search using 59 queries through dataset > that takes 17 minutes and yields XML with 3957MB > in size. Parsing the XML file through biopython takes 56 > minutes to convert the results into my own CSV file How does this compare to parsing human-readable plain text output? Is it significantly faster than the XML parser? > With plaintext I actually meant more some tabular > output format which would be enough for my purposes > (match and query coordinates, scores, gaps, identities). Maintaining the tabular Blast output parser has not been a problem, and I expect that it will continue to be supported in Biopython. On the other hand, maintaining the human-readable plain text parser has been a recurring headache. If Biopython can parse tabular Blast output, then do you still need the human-readable plain text parser? Best, -Michiel. From mmokrejs at fold.natur.cuni.cz Fri Sep 14 05:47:48 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 11:47:48 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: <5052FD44.3040302@fold.natur.cuni.cz> Hi Michiel, Michiel de Hoon wrote: > Hi Martin, > > --- On Fri, 9/14/12, Martin Mokrejs wrote: >> Legacy blastn search using 59 queries through dataset >> that takes 17 minutes and yields XML with 3957MB >> in size. Parsing the XML file through biopython takes 56 >> minutes to convert the results into my own CSV file > > How does this compare to parsing human-readable plain text output? Is > it significantly faster than the XML parser? I don't have numbers but say mdust program (compiled from C) parsed the FASTA file in 6 minutes so I would be happy with roughly same time needed for parsing a CSV file having at about 1/5 of the lines in the FASTA file. Biopython is using generators and I do that as well in my program so the main overhead in my program is string slicing, string to int/float/list conversion. > >> With plaintext I actually meant more some tabular >> output format which would be enough for my purposes >> (match and query coordinates, scores, gaps, identities). > > Maintaining the tabular Blast output parser has not been a problem, > and I expect that it will continue to be supported in Biopython. On > the other hand, maintaining the human-readable plain text parser has > been a recurring headache. If Biopython can parse tabular Blast > output, then do you still need the human-readable plain text parser? Sometimes I parsed the alignment to have in hands number of matches, mismatches (the pipes, minuses, dots) but not at this very moment. Their distribution along the alignment is important and sometimes helpful. BTW, I hate that blastn is changing letter-casing os the sequence in its output. ;-) Martin From mmokrejs at fold.natur.cuni.cz Fri Sep 14 05:52:24 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 11:52:24 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: <5052FE58.8060000@fold.natur.cuni.cz> Hi Peter, Peter Cock wrote: > On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs > wrote: >> Hi all, >> as a long-term subscriber to this list and bioperl in the past as well I do know >> that the plaintext output is being changed silently and that it is a hassle to >> maintainers. On the other hand, the XML tags and syntax is way too verbose. >> That in turn means lots of disc&memory IO, long parsing times and of course file size. >> At least if the XML tags would be scrambled to be shorter strings. ;-) >> Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: >> https://redmine.open-bio.org/issues/3354 > > Earlier this week the NCBI released BLAST 2.2.27+ which might > fix this... > ... > > I find the BLAST+ tabular output very useful - you can control which > columns you get if the default 12 are not enough - and trivial to parse. > This is also supported in Bow's SearchIO branch. Based on the 2.2.27 number you seem to talk about old/legacy blast ... but the plus means the new blast from NCBI? I don't like the new blast, it just gives different=bad results and I just don't have time to make up a good bug report with testcases. :(( Will see what Wibowo's code. Well, the XML result is same I think from both programs. Martin From p.j.a.cock at googlemail.com Fri Sep 14 06:00:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Sep 2012 11:00:33 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052FE58.8060000@fold.natur.cuni.cz> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> <5052FE58.8060000@fold.natur.cuni.cz> Message-ID: On Fri, Sep 14, 2012 at 10:52 AM, Martin Mokrejs wrote: > Hi Peter, > > Peter Cock wrote: >> On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs >> wrote: >>> Hi all, >>> as a long-term subscriber to this list and bioperl in the past as well I do know >>> that the plaintext output is being changed silently and that it is a hassle to >>> maintainers. On the other hand, the XML tags and syntax is way too verbose. >>> That in turn means lots of disc&memory IO, long parsing times and of course file size. >>> At least if the XML tags would be scrambled to be shorter strings. ;-) >>> Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: >>> https://redmine.open-bio.org/issues/3354 >> >> Earlier this week the NCBI released BLAST 2.2.27+ which might >> fix this... >> > ... >> >> I find the BLAST+ tabular output very useful - you can control which >> columns you get if the default 12 are not enough - and trivial to parse. >> This is also supported in Bow's SearchIO branch. > > Based on the 2.2.27 number you seem to talk about old/legacy blast ... > but the plus means the new blast from NCBI? The NCBI call version "2.2.27" of the new C++ rewrite "BLAST v2.2.27+" (while personally I'd have called it BLAST+ v2.2.27 instead). The NCBI have now stopped updating legacy BLAST. > I don't like the new blast, it just gives different=bad results and I > just don't have time to make up a good bug report with testcases. :(( You are not alone in having problems/regressions with BLAST+ compared to legacy BLAST. I can think of several people still using 'blastall' for this reason. > Will see what Wibowo's code. Well, the XML result is same I > think from both programs. I think it is practically the same. Peter From mjldehoon at yahoo.com Fri Sep 14 22:43:12 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 14 Sep 2012 19:43:12 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> Message-ID: <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Last weekend I also talked with Peter during his visit to Tokyo about the Blast (human-readable) plain-text parser. We could see three scenarios in which the plain-text parser has an advantage over the XML parser (Peter please correct me if I am missing something from our discussion): 1) The file size of Blast plain-text output may be smaller than that of Blast XML output; 2) Users may want to look at the Blast output by eye in addition to parsing it with Biopython; 3) Users may have stacks of old Blast output files in plain-text format that they still want to use. Each of these points can be addressed without a Blast plain-text parser: 1) After zipping, we expect little difference in file size between plain-text output and XML output; 2) If we add a function to Biopython that generates Blast plain-text output (or something close to it) from Blast XML output, then a user can generate the Blast output in XML format, parse it with Biopython, optionally filter it, and then generate the corresponding plain-text output; 3) If this is really an issue, then we could create some standalone scripts (available from the Biopython website) that parses plain-text Blast output and generates the corresponding XML output. These scripts will be much easier than the current plain-text parser in Biopython, because we can create such a script for each version of Blast separately (of course this is only done if the need actually arises). The XML output can then be parsed by Biopython. Are there any other cases in which the plain-text parser is needed? Or where our proposed solutions to the three points above are not sufficient? If not, then I suggest we implement the plain-text generator in (2), and upgrade the PendingDeprecationWarning in Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. Best, -Michiel --- On Thu, 9/13/12, Fields, Christopher J wrote: > From: Fields, Christopher J > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "Michiel de Hoon" > Cc: "BioPython Mailing List" , "Martin Mokrejs" > Date: Thursday, September 13, 2012, 9:32 PM > On Sep 13, 2012, at 7:37 PM, Michiel > de Hoon > wrote: > > > --- On Thu, 9/13/12, Martin Mokrejs > wrote: > >> P.S.: And yes, I would love to parse blastn > plaintext output > >> or some other more compact one, the XML is really > an overkill. > > > > What exactly is the advantage of plain text parsing > compared to XML? File size? > > > > Best, > > -Michiel. > > There isn't any.? In fact, NCBI has consistently stated > that one should never rely on parsing BLAST text output, > primarily b/c they reserve the right to make changes to the > output at any given point, whereas XML output should remain > stable.? As someone who has taken care of legacy BLAST > code for a number of years (BioPerl), I can state that is > fairly close to the truth (the caveat being they have made > changes that break some XML parsing, but they do try to fix > them).? BLAST XML has simply been much easier to deal > with in terms of fixing issues than text. > > chris > > From p.j.a.cock at googlemail.com Sat Sep 15 06:37:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Sep 2012 11:37:50 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon wrote: > Last weekend I also talked with Peter during his visit to Tokyo about the > Blast (human-readable) plain-text parser. We could see three scenarios in > which the plain-text parser has an advantage over the XML parser (Peter > please correct me if I am missing something from our discussion): > > 1) The file size of Blast plain-text output may be smaller than that of > Blast XML output; > 2) Users may want to look at the Blast output by eye in addition to > parsing it with Biopython; > 3) Users may have stacks of old Blast output files in plain-text format > that they still want to use. Maybe also (3a) The user may want plain-text BLAST output to input into another tool as well as Biopython? > > Each of these points can be addressed without a Blast plain-text parser: > 1) After zipping, we expect little difference in file size between > plain-text output and XML output; However there would be a speed penalty - compression, then decompression, and perhaps in XML versus text parsing. > 2) If we add a function to Biopython that generates Blast plain-text > output (or something close to it) from Blast XML output, then a user can > generate the Blast output in XML format, parse it with Biopython, optionally > filter it, and then generate the corresponding plain-text output; The new 'SearchIO' results objects str/repr should be familiar to anyone who has looked at the plain text BLAST output - but not identical. We could apply some of these improvements to the current BLAST parsers, but I favour aiming to simply deprecate them in favour of 'SearchIO' (namespace to be decided). However, we certainly could try and offer a plain-text BLAST output format from 'SearchIO', although IIRC Bow has not tried that yet. It shouldn't be too complicated - unless you aim for 100% agreement with the latest BLAST output (moving target). > 3) If this is really an issue, then we could create some standalone > scripts (available from the Biopython website) that parses plain-text Blast > output and generates the corresponding XML output. These scripts will be > much easier than the current plain-text parser in Biopython, because we can > create such a script for each version of Blast separately (of course this is > only done if the need actually arises). The XML output can then be parsed by > Biopython. I was not convinced that this would actually save any effort over continuing to tweak the current (complex but flexible) plain text parser. > Are there any other cases in which the plain-text parser is needed? > Or where our proposed solutions to the three points above are not > sufficient? Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML, and (d) column rich tabular output might be worthwhile. There may be a case for parsing plain-text on the basis of speed. > If not, then I suggest we implement the plain-text generator in (2), > I certainly this adding plain-text output to 'SearchIO' would be useful. > and upgrade the PendingDeprecationWarning in > Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. Another idea we touched on was deprecating the current old, complex but flexible plain text parser while adding a new simpler plain text parser as part of 'SearchIO'. Here we could target only the recent BLAST+ output (and perhaps if not so different the final 'legacy' BLAST release), and not worry about all the variants the NCBI have produced over the years. I would hope this would also be faster [especially as currently 'SearchIO' supports parsing plain text BLAST on top of the existing old parser]. This boils down to a key question: How many people still want to use the plain-text output and why? I believe that for most use cases the tabular or XML output is better (covering simple needs, and full parsing of every detail respectively). e.g. It sounds like for Martin's example, the tabular output would be a perfect match. [Although, as I noted above, parsing the XML, especially if compressed, may not be as fast as parsing plain text?] While writing this email I was trying to recall when I last used the plain text output - and the only situation I could think of in the last year or so was in order to have something human readable to show a collaborator. Here XML to plain text BLAST would have been fine. Peter From p.j.a.cock at googlemail.com Sat Sep 15 06:49:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Sep 2012 11:49:59 +0100 Subject: [Biopython] Finally deprecating the plain text BLAST parser? Message-ID: Hello all, I've retitled this from Martin's thread initially about the BLAST XML parser: http://lists.open-bio.org/pipermail/biopython/2012-September/008154.html ... http://lists.open-bio.org/pipermail/biopython/2012-September/008164.html http://lists.open-bio.org/pipermail/biopython/2012-September/008165.html The topic shifted and an important question raised was: Should we finally deprecate the 'obsolete' plain text BLAST parser? So - is anyone on the list still using this file format, and why? [ Speak now or forever hold your peace ;) ] Thanks, Peter On Sat, Sep 15, 2012 at 11:37 AM, Peter Cock wrote: > On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon wrote: >> Last weekend I also talked with Peter during his visit to Tokyo about the >> Blast (human-readable) plain-text parser. We could see three scenarios in >> which the plain-text parser has an advantage over the XML parser (Peter >> please correct me if I am missing something from our discussion): >> >> 1) The file size of Blast plain-text output may be smaller than that of >> Blast XML output; >> 2) Users may want to look at the Blast output by eye in addition to >> parsing it with Biopython; >> 3) Users may have stacks of old Blast output files in plain-text format >> that they still want to use. > > Maybe also (3a) The user may want plain-text BLAST output to > input into another tool as well as Biopython? > >> >> Each of these points can be addressed without a Blast plain-text parser: >> 1) After zipping, we expect little difference in file size between >> plain-text output and XML output; > > However there would be a speed penalty - compression, then > decompression, and perhaps in XML versus text parsing. > >> 2) If we add a function to Biopython that generates Blast plain-text >> output (or something close to it) from Blast XML output, then a user can >> generate the Blast output in XML format, parse it with Biopython, optionally >> filter it, and then generate the corresponding plain-text output; > > The new 'SearchIO' results objects str/repr should be familiar to > anyone who has looked at the plain text BLAST output - but > not identical. We could apply some of these improvements > to the current BLAST parsers, but I favour aiming to simply > deprecate them in favour of 'SearchIO' (namespace to be > decided). > > However, we certainly could try and offer a plain-text BLAST > output format from 'SearchIO', although IIRC Bow has not tried > that yet. It shouldn't be too complicated - unless you aim for > 100% agreement with the latest BLAST output (moving target). > >> 3) If this is really an issue, then we could create some standalone >> scripts (available from the Biopython website) that parses plain-text Blast >> output and generates the corresponding XML output. These scripts will be >> much easier than the current plain-text parser in Biopython, because we can >> create such a script for each version of Blast separately (of course this is >> only done if the need actually arises). The XML output can then be parsed by >> Biopython. > > I was not convinced that this would actually save any effort over > continuing to tweak the current (complex but flexible) plain text > parser. > >> Are there any other cases in which the plain-text parser is needed? >> Or where our proposed solutions to the three points above are not >> sufficient? > > Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML, > and (d) column rich tabular output might be worthwhile. There may > be a case for parsing plain-text on the basis of speed. > >> If not, then I suggest we implement the plain-text generator in (2), >> > > I certainly this adding plain-text output to 'SearchIO' would be > useful. > >> and upgrade the PendingDeprecationWarning in >> Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. > > Another idea we touched on was deprecating the current old, > complex but flexible plain text parser while adding a new simpler > plain text parser as part of 'SearchIO'. Here we could target only > the recent BLAST+ output (and perhaps if not so different the > final 'legacy' BLAST release), and not worry about all the variants > the NCBI have produced over the years. I would hope this would > also be faster [especially as currently 'SearchIO' supports parsing > plain text BLAST on top of the existing old parser]. > > This boils down to a key question: How many people still want > to use the plain-text output and why? I believe that for most > use cases the tabular or XML output is better (covering simple > needs, and full parsing of every detail respectively). > > e.g. It sounds like for Martin's example, the tabular output would > be a perfect match. > > [Although, as I noted above, parsing the XML, especially if > compressed, may not be as fast as parsing plain text?] > > While writing this email I was trying to recall when I last used > the plain text output - and the only situation I could think of > in the last year or so was in order to have something human > readable to show a collaborator. Here XML to plain text BLAST > would have been fine. > > Peter From w.arindrarto at gmail.com Sat Sep 15 09:22:48 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 15 Sep 2012 15:22:48 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Hi guys, > > 2) If we add a function to Biopython that generates Blast plain-text > > output (or something close to it) from Blast XML output, then a user can > > generate the Blast output in XML format, parse it with Biopython, > > optionally > > filter it, and then generate the corresponding plain-text output; > > The new 'SearchIO' results objects str/repr should be familiar to > anyone who has looked at the plain text BLAST output - but > not identical. We could apply some of these improvements > to the current BLAST parsers, but I favour aiming to simply > deprecate them in favour of 'SearchIO' (namespace to be > decided). > > However, we certainly could try and offer a plain-text BLAST > output format from 'SearchIO', although IIRC Bow has not tried > that yet. It shouldn't be too complicated - unless you aim for > 100% agreement with the latest BLAST output (moving target). Yes, this has not been attempted ~ mostly because I feel that the BLAST plain text is indeed a moving target. But, if we are in favor of choosing one format from one BLAST version and always stick to it, it sounds more reasonable. There are one missing detail that is only present in the plain text format, though: the hit-level e-values. If we do decide to write a plain text writer, we either have to demand the user supply these values, or we omit the entire hit-level e-value table, or we fill it with something else. > Another idea we touched on was deprecating the current old, > complex but flexible plain text parser while adding a new simpler > plain text parser as part of 'SearchIO'. Here we could target only > the recent BLAST+ output (and perhaps if not so different the > final 'legacy' BLAST release), and not worry about all the variants > the NCBI have produced over the years. I would hope this would > also be faster [especially as currently 'SearchIO' supports parsing > plain text BLAST on top of the existing old parser]. This wasn't attempted as well, mostly because I feel that a lot of people still use legacy BLAST (we've had more legacy-BLAST related emails rather than BLAST+ ones in the past few months, I think). Also, the current parser wins on flexibility. I think the test cases include BLAST versions from 2002 (10 years ago!) up to BLAST 2.2.25+. So like Peter mentioned, the current SearchIO BLAST plain text parser is actually a simple wrapper over Bio.Blast.NCBIStandalone. We might be able to create a newer, speedier parser, but making it as flexible as our current one seems difficult. regards, Bow From mjldehoon at yahoo.com Sun Sep 16 09:54:37 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 16 Sep 2012 06:54:37 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Hi Bow, Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast. Thanks, -Michiel. --- On Sat, 9/15/12, Wibowo Arindrarto wrote: > From: Wibowo Arindrarto > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "BioPython Mailing List" > Date: Saturday, September 15, 2012, 9:22 AM > Hi guys, > > > > 2) If we add a function to Biopython that > generates Blast plain-text > > > output (or something close to it) from Blast XML > output, then a user can > > > generate the Blast output in XML format, parse it > with Biopython, > > > optionally > > > filter it, and then generate the corresponding > plain-text output; > > > > The new 'SearchIO' results objects str/repr should be > familiar to > > anyone who has looked at the plain text BLAST output - > but > > not identical. We could apply some of these > improvements > > to the current BLAST parsers, but I favour aiming to > simply > > deprecate them in favour of 'SearchIO' (namespace to > be > > decided). > > > > However, we certainly could try and offer a plain-text > BLAST > > output format from 'SearchIO', although IIRC Bow has > not tried > > that yet. It shouldn't be too complicated - unless you > aim for > > 100% agreement with the latest BLAST output (moving > target). > > Yes, this has not been attempted ~ mostly because I feel > that the > BLAST plain text is indeed a moving target. But, if we are > in favor of > choosing one format from one BLAST version and always stick > to it, it > sounds more reasonable. > > There are one missing detail that is only present in the > plain text > format, though: the hit-level e-values. If we do decide to > write a > plain text writer, we either have to demand the user supply > these > values, or we omit the entire hit-level e-value table, or we > fill it > with something else. > > > Another idea we touched on was deprecating the current > old, > > complex but flexible plain text parser while adding a > new simpler > > plain text parser as part of 'SearchIO'. Here we could > target only > > the recent BLAST+ output (and perhaps if not so > different the > > final 'legacy' BLAST release), and not worry about all > the variants > > the NCBI have produced over the years. I would hope > this would > > also be faster [especially as currently 'SearchIO' > supports parsing > > plain text BLAST on top of the existing old parser]. > > This wasn't attempted as well, mostly because I feel that a > lot of > people still use legacy BLAST (we've had more legacy-BLAST > related > emails rather than BLAST+ ones in the past few months, I > think). Also, > the current parser wins on flexibility. I think the test > cases include > BLAST versions from 2002 (10 years ago!) up to BLAST > 2.2.25+. So like > Peter mentioned, the current SearchIO BLAST plain text > parser is > actually a simple wrapper over Bio.Blast.NCBIStandalone. > > We might be able to create a newer, speedier parser, but > making it as > flexible as our current one seems difficult. > > regards, > Bow > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Sun Sep 16 10:21:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 16 Sep 2012 16:21:52 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hi Michiel, We have a draft tutorial that I'm temporarily hosting here: http://bow.web.id/biopython/Tutorial.html#htoc96. The internal functions have also been documented with docstrings and quick examples (e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py). At the moment, the SearchIO API is very similar to SeqIO and AlignIO, though in the future this is still subject to change. Hope this helps :), otherwise let me know which part is specifically unclear for you. regards, Bow On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon wrote: > Hi Bow, > > Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast. > > Thanks, > -Michiel. > > --- On Sat, 9/15/12, Wibowo Arindrarto wrote: > >> From: Wibowo Arindrarto >> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? >> To: "BioPython Mailing List" >> Date: Saturday, September 15, 2012, 9:22 AM >> Hi guys, >> >> > > 2) If we add a function to Biopython that >> generates Blast plain-text >> > > output (or something close to it) from Blast XML >> output, then a user can >> > > generate the Blast output in XML format, parse it >> with Biopython, >> > > optionally >> > > filter it, and then generate the corresponding >> plain-text output; >> > >> > The new 'SearchIO' results objects str/repr should be >> familiar to >> > anyone who has looked at the plain text BLAST output - >> but >> > not identical. We could apply some of these >> improvements >> > to the current BLAST parsers, but I favour aiming to >> simply >> > deprecate them in favour of 'SearchIO' (namespace to >> be >> > decided). >> > >> > However, we certainly could try and offer a plain-text >> BLAST >> > output format from 'SearchIO', although IIRC Bow has >> not tried >> > that yet. It shouldn't be too complicated - unless you >> aim for >> > 100% agreement with the latest BLAST output (moving >> target). >> >> Yes, this has not been attempted ~ mostly because I feel >> that the >> BLAST plain text is indeed a moving target. But, if we are >> in favor of >> choosing one format from one BLAST version and always stick >> to it, it >> sounds more reasonable. >> >> There are one missing detail that is only present in the >> plain text >> format, though: the hit-level e-values. If we do decide to >> write a >> plain text writer, we either have to demand the user supply >> these >> values, or we omit the entire hit-level e-value table, or we >> fill it >> with something else. >> >> > Another idea we touched on was deprecating the current >> old, >> > complex but flexible plain text parser while adding a >> new simpler >> > plain text parser as part of 'SearchIO'. Here we could >> target only >> > the recent BLAST+ output (and perhaps if not so >> different the >> > final 'legacy' BLAST release), and not worry about all >> the variants >> > the NCBI have produced over the years. I would hope >> this would >> > also be faster [especially as currently 'SearchIO' >> supports parsing >> > plain text BLAST on top of the existing old parser]. >> >> This wasn't attempted as well, mostly because I feel that a >> lot of >> people still use legacy BLAST (we've had more legacy-BLAST >> related >> emails rather than BLAST+ ones in the past few months, I >> think). Also, >> the current parser wins on flexibility. I think the test >> cases include >> BLAST versions from 2002 (10 years ago!) up to BLAST >> 2.2.25+. So like >> Peter mentioned, the current SearchIO BLAST plain text >> parser is >> actually a simple wrapper over Bio.Blast.NCBIStandalone. >> >> We might be able to create a newer, speedier parser, but >> making it as >> flexible as our current one seems difficult. >> >> regards, >> Bow >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From mjldehoon at yahoo.com Sun Sep 16 12:24:36 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 16 Sep 2012 09:24:36 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Bow, Thanks for the links! This is actually the first time I looked at the SearchIO module in detail. I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already. So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast. Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object. Best, -Michiel. --- On Sun, 9/16/12, Wibowo Arindrarto wrote: > From: Wibowo Arindrarto > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "Michiel de Hoon" > Cc: "BioPython Mailing List" > Date: Sunday, September 16, 2012, 10:21 AM > Hi Michiel, > > We have a draft tutorial that I'm temporarily hosting here: > http://bow.web.id/biopython/Tutorial.html#htoc96. The > internal > functions have also been documented with docstrings and > quick examples > (e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py). > > At the moment, the SearchIO API is very similar to SeqIO and > AlignIO, > though in the future this is still subject to change. > > Hope this helps :), otherwise let me know which part is > specifically > unclear for you. > > regards, > Bow > > On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon > wrote: > > Hi Bow, > > > > Is there some documentation somewhere for the SearchIO > module? I have a hard time understanding what it does and > how it relates to Blast. > > > > Thanks, > > -Michiel. > > > > --- On Sat, 9/15/12, Wibowo Arindrarto > wrote: > > > >> From: Wibowo Arindrarto > >> Subject: Re: [Biopython] Legacy blastn XML outfile > parsing is slow. What XML parser is actually used? > >> To: "BioPython Mailing List" > >> Date: Saturday, September 15, 2012, 9:22 AM > >> Hi guys, > >> > >> > > 2) If we add a function to Biopython > that > >> generates Blast plain-text > >> > > output (or something close to it) from > Blast XML > >> output, then a user can > >> > > generate the Blast output in XML format, > parse it > >> with Biopython, > >> > > optionally > >> > > filter it, and then generate the > corresponding > >> plain-text output; > >> > > >> > The new 'SearchIO' results objects str/repr > should be > >> familiar to > >> > anyone who has looked at the plain text BLAST > output - > >> but > >> > not identical. We could apply some of these > >> improvements > >> > to the current BLAST parsers, but I favour > aiming to > >> simply > >> > deprecate them in favour of 'SearchIO' > (namespace to > >> be > >> > decided). > >> > > >> > However, we certainly could try and offer a > plain-text > >> BLAST > >> > output format from 'SearchIO', although IIRC > Bow has > >> not tried > >> > that yet. It shouldn't be too complicated - > unless you > >> aim for > >> > 100% agreement with the latest BLAST output > (moving > >> target). > >> > >> Yes, this has not been attempted ~ mostly because I > feel > >> that the > >> BLAST plain text is indeed a moving target. But, if > we are > >> in favor of > >> choosing one format from one BLAST version and > always stick > >> to it, it > >> sounds more reasonable. > >> > >> There are one missing detail that is only present > in the > >> plain text > >> format, though: the hit-level e-values. If we do > decide to > >> write a > >> plain text writer, we either have to demand the > user supply > >> these > >> values, or we omit the entire hit-level e-value > table, or we > >> fill it > >> with something else. > >> > >> > Another idea we touched on was deprecating the > current > >> old, > >> > complex but flexible plain text parser while > adding a > >> new simpler > >> > plain text parser as part of 'SearchIO'. Here > we could > >> target only > >> > the recent BLAST+ output (and perhaps if not > so > >> different the > >> > final 'legacy' BLAST release), and not worry > about all > >> the variants > >> > the NCBI have produced over the years. I would > hope > >> this would > >> > also be faster [especially as currently > 'SearchIO' > >> supports parsing > >> > plain text BLAST on top of the existing old > parser]. > >> > >> This wasn't attempted as well, mostly because I > feel that a > >> lot of > >> people still use legacy BLAST (we've had more > legacy-BLAST > >> related > >> emails rather than BLAST+ ones in the past few > months, I > >> think). Also, > >> the current parser wins on flexibility. I think the > test > >> cases include > >> BLAST versions from 2002 (10 years ago!) up to > BLAST > >> 2.2.25+. So like > >> Peter mentioned, the current SearchIO BLAST plain > text > >> parser is > >> actually a simple wrapper over > Bio.Blast.NCBIStandalone. > >> > >> We might be able to create a newer, speedier > parser, but > >> making it as > >> flexible as our current one seems difficult. > >> > >> regards, > >> Bow > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > From w.arindrarto at gmail.com Sun Sep 16 13:44:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 16 Sep 2012 19:44:23 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi Michiel, > Thanks for the links! This is actually the first time I looked at the SearchIO module in detail. You're welcome :). > I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already. > > So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast. The plan that Peter and I discussed was indeed to eventually deprecate Bio.Blast in favor of SearchIO. I prefer not to use Bio.Blast precisely for the reason you mentioned. I think we last discussed that we may use Bio.Seq.Search as the name (or bio.seq.search, after we settled on the namespace). Also, the bio.seq.search (or whatever we will call it) module will have wrappers for sequence search command line and web tools. Of course, this won't be for BLAST only. In another branch, I've written a draft HMMER wrapper and a partial BLAT wrapper. For the web tool, the HMMER devs also have a web service for which we could create a wrapper. > Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object. > > Best, > -Michiel. If you are talking about using the slice notation to retrieve object attributes, that could be difficult for users. Most of the current SearchIO objects are themselves containers of other objects (the object model is nested). I could try implementing some hacks so that the attributes are stored in a dictionary, but I think this would confuse users when they use the slice notation (am I retrieving an attribute or a nested SearchIO object?). Maybe what you have in mind is a single dictionary stored as an object attribute as the interface? For example, we could have object.attribs as the dictionary and we could use object.attribs['e-value'] for example). We do gain '-' instead of '_' and `.keys()` using this, but at the cost of brevity, so I have a mixed feeling towards this. If users want to find out what the attributes are, they can use object.__dict__.keys(). I could try create a common property (e.g. object.attrib_names) that returns a list of all available attribute names for a given object. But for now, this seems a little bit too excessive for me (could be done if more people desire otherwise, though). Thanks for taking a look, by the way. Always appreciate a new set of fresh perspectives :). regards, Bow From p.j.a.cock at googlemail.com Sun Sep 16 15:17:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Sep 2012 20:17:12 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Sun, Sep 16, 2012 at 5:24 PM, Michiel de Hoon wrote: > > Also I noticed that SearchIO (like Bio.Blast) uses attributes > to store information. I would much rather see a dictionary-like > interface. This has the advantage that we can keep the key > name much closer to what is in the original file (for example, > no need to replace '-' by '_'), and also users can call .keys() > to find out what is stored in the object. I don't see a dictionary as being inherently easier to use. You also use dir(obj) to see the attributes, which are more flexible as you can implement them as properties and have code behind them if needed. Another key point is we can add docstrings to attributes/properties to give help text - and you can't do that with a dictionary key. Also different file formats use different terms for what is really the same idea - I envisioned SearchIO as a unified parser, which means imposing a common naming convention for these key fields. I also think that certain core bits of information common to BLAST, HMMER, etc should be exposed at the property level (including query match names and co-ordinates). Here we're going to standardise start/end values to integers using Python counting, consistent strand notation etc. As in the SeqRecord and SeqFeature, a dictionary makes perfect sense for general 'free form' information. And this approach is used here too. Regards, Peter From semenko at alum.mit.edu Mon Sep 17 13:01:00 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:01:00 -0500 Subject: [Biopython] Error parsing EMBL file Message-ID: I'm trying to extract the peptide sequences from a large collection of EMBL-formatted files (all phage & virus data from EBI). EBI provides these as large, concatenated EMBL files, so I've been using SeqIO.parse to read & then write the 'translation' key from seq_feature.qualifiers. Unfortunately, it looks like the parser dies on one input file: http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true Traceback (most recent call last): File "gbk_to_faa.py", line 7, in for seq_record in SeqIO.parse(input_handle, "embl") : File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse for r in i: File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 440, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 692, in _feed_header_lines consumer.reference_bases("(bases %s)" % "; ".join(parts)) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 740, in reference_bases locations = self._split_reference_locations(ref_base_info) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 777, in _split_reference_locations start, end = base_info.split('to') ValueError: need more than 1 value to unpack * I might dig into this a bit more to patch, but does anyone more familiar with EMBL files know what's going on? * Also, is there are more straightforward (or even non-BioPython way) to go from EMBL->FAA? Best, Nick -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From semenko at alum.mit.edu Mon Sep 17 13:22:26 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:22:26 -0500 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References: Message-ID: Looks like it's dying at a line-wrapped location string: RN [16] RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, RP 41454-41724 RX DOI; 10.1128/JB.185.4.1475-1477.2003. RX PUBMED; 12562822. RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., RA Casjens S.R.; RT "Corrected sequence of the bacteriophage p22 genome"; RL J. Bacteriol. 185(4):1475-1477(2003). This works if RP is just one line: RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 On Mon, Sep 17, 2012 at 12:01 PM, Nick Semenkovich wrote: > I'm trying to extract the peptide sequences from a large collection of > EMBL-formatted files (all phage & virus data from EBI). > > EBI provides these as large, concatenated EMBL files, so I've been > using SeqIO.parse to read & then write the 'translation' key from > seq_feature.qualifiers. > > > Unfortunately, it looks like the parser dies on one input file: > > http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true > > Traceback (most recent call last): > File "gbk_to_faa.py", line 7, in > for seq_record in SeqIO.parse(input_handle, "embl") : > File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse > for r in i: > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line > 440, in parse_records > record = self.parse(handle, do_features) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse > if self.feed(handle, consumer, do_features): > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed > self._feed_header_lines(consumer, self.parse_header()) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line > 692, in _feed_header_lines > consumer.reference_bases("(bases %s)" % "; ".join(parts)) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line > 740, in reference_bases > locations = self._split_reference_locations(ref_base_info) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line > 777, in _split_reference_locations > start, end = base_info.split('to') > ValueError: need more than 1 value to unpack > > > * I might dig into this a bit more to patch, but does anyone more > familiar with EMBL files know what's going on? > > * Also, is there are more straightforward (or even non-BioPython way) > to go from EMBL->FAA? > > > Best, > Nick > > -- > Nick Semenkovich > Laboratory of Dr. Jeffrey I. Gordon > Medical Scientist Training Program > School of Medicine > Washington University in St. Louis > 314.362.3963 (Lab) > http://web.mit.edu/semenko/ -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Mon Sep 17 13:31:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Sep 2012 18:31:38 +0100 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References:

Message-ID: On Mon, Sep 17, 2012 at 6:22 PM, Nick Semenkovich wrote: > Looks like it's dying at a line-wrapped location string: > > RN [16] > RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, > RP 41454-41724 > RX DOI; 10.1128/JB.185.4.1475-1477.2003. > RX PUBMED; 12562822. > RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., > RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., > RA Casjens S.R.; > RT "Corrected sequence of the bacteriophage p22 genome"; > RL J. Bacteriol. 185(4):1475-1477(2003). > > > This works if RP is just one line: > RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 Good detective work :) Can you try with this fix? https://github.com/biopython/biopython/commit/0da9d7e72a95fe788c7c32c9cbc2ac95d84bb7b7 If you installed from source, the easiest way would be to grab the latest code from git and reinstall. If you installed from a package, perhaps you might prefer to manually hack the file to make the one line change by hand? Back it up first ;) /usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py Peter From semenko at alum.mit.edu Mon Sep 17 13:36:16 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:36:16 -0500 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References:

Message-ID: Awesome -- works great! (virtualenv makes this so easy!) Thanks for the quick patch! - Nick On Mon, Sep 17, 2012 at 12:31 PM, Peter Cock wrote: > On Mon, Sep 17, 2012 at 6:22 PM, Nick Semenkovich wrote: >> Looks like it's dying at a line-wrapped location string: >> >> RN [16] >> RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, >> RP 41454-41724 >> RX DOI; 10.1128/JB.185.4.1475-1477.2003. >> RX PUBMED; 12562822. >> RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., >> RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., >> RA Casjens S.R.; >> RT "Corrected sequence of the bacteriophage p22 genome"; >> RL J. Bacteriol. 185(4):1475-1477(2003). >> >> >> This works if RP is just one line: >> RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 > > Good detective work :) > > Can you try with this fix? > https://github.com/biopython/biopython/commit/0da9d7e72a95fe788c7c32c9cbc2ac95d84bb7b7 > > If you installed from source, the easiest way would be to grab the latest > code from git and reinstall. > > If you installed from a package, perhaps you might prefer to manually > hack the file to make the one line change by hand? Back it up first ;) > /usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py > > Peter -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Tue Sep 18 05:52:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Sep 2012 10:52:13 +0100 Subject: [Biopython] Removing HotRand.py? In-Reply-To: References: Message-ID: On Tue, Sep 11, 2012 at 2:18 AM, Nick Semenkovich wrote: > I've submitted a pull request to deprecate Bio/HotRand.py > > Is anyone still using the HotRandom functions? > > > It looks like the module is pretty old and there are some better alternatives: > http://pypi.python.org/pypi/randomdotorg/ > > > Pull Request: https://github.com/biopython/biopython/pull/69 > > Best, > Nick Semenkovich Since no one has objected, I will commit this. Our gradual deprecation process means there is still time to reverse this and/or delay the removal of Bio.HotRand if needed. Thanks, Peter From cfriedline at vcu.edu Tue Sep 18 17:34:11 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 18 Sep 2012 17:34:11 -0400 Subject: [Biopython] multiprocessing and SeqIO.index_db() Message-ID: Hi, I ran into this today, and wondering if there is a work around. If I attempt to index multiple files with multiprocessing using SeqIO.index_db(), I can create the databases, but I'm unable to open them after they come back from the async process. Instead, I get this when trying to (say) print the dictionary: File "/Users/chris/.virtualenvs/default/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 112, in __str__ return "{%s : SeqRecord(...), ...}" % repr(self.keys()[0]) File "/Users/chris/.virtualenvs/default/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 416, in keys self._con.execute("SELECT key FROM offset_data;").fetchall()] sqlite3.ProgrammingError: Base Connection.__init__ not called. As a workaround, I'm just calling index_db again. From the source code, it appears that the index is not rebuilt by doing this, and it seems to work OK. Is this just a multiprocessing/pickling issue? Thanks, Chris From eric.talevich at gmail.com Wed Sep 19 14:10:31 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 19 Sep 2012 14:10:31 -0400 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 9:22 AM, Wibowo Arindrarto wrote: > Hi guys, > > > > 2) If we add a function to Biopython that generates Blast plain-text > > > output (or something close to it) from Blast XML output, then a user > can > > > generate the Blast output in XML format, parse it with Biopython, > > > optionally > > > filter it, and then generate the corresponding plain-text output; > > > > The new 'SearchIO' results objects str/repr should be familiar to > > anyone who has looked at the plain text BLAST output - but > > not identical. We could apply some of these improvements > > to the current BLAST parsers, but I favour aiming to simply > > deprecate them in favour of 'SearchIO' (namespace to be > > decided). > > > > However, we certainly could try and offer a plain-text BLAST > > output format from 'SearchIO', although IIRC Bow has not tried > > that yet. It shouldn't be too complicated - unless you aim for > > 100% agreement with the latest BLAST output (moving target). > > Yes, this has not been attempted ~ mostly because I feel that the > BLAST plain text is indeed a moving target. But, if we are in favor of > choosing one format from one BLAST version and always stick to it, it > sounds more reasonable. > Since NCBI is not planning to make any more changes to "legacy" blastall, this could be an opportunity to settle on once stable plain-text BLAST output style to parse in Bio.Search(IO), and admit that we're not going to bother keeping up with BLAST+ plain-text reports. (I imagine there's a certain degree of overlap between users stuck with legacy BLAST installations and those stuck with plain-text BLAST reports.) > > There are one missing detail that is only present in the plain text > format, though: the hit-level e-values. If we do decide to write a > plain text writer, we either have to demand the user supply these > values, or we omit the entire hit-level e-value table, or we fill it > with something else. > But the Hsp-level scores or bitscores are included, right? The database size, query length and Alschul-Karlin kappa and lambda values are included in the BLAST XML output, so it's possible (and not difficult) to recalculate the e-values. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head3 Note that BLAST tweaks the raw alignment score with their own heuristics, so it's not easy to get the raw score from the alignment in the XML. But once you have the raw score, the rest is straightforward. Cheers, Eric From p.j.a.cock at googlemail.com Fri Sep 21 09:22:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Sep 2012 14:22:48 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 2:22 PM, Wibowo Arindrarto wrote: > Hi guys, > >> > 2) If we add a function to Biopython that generates Blast plain-text >> > output (or something close to it) from Blast XML output, then a user can >> > generate the Blast output in XML format, parse it with Biopython, >> > optionally >> > filter it, and then generate the corresponding plain-text output; >> >> The new 'SearchIO' results objects str/repr should be familiar to >> anyone who has looked at the plain text BLAST output - but >> not identical. We could apply some of these improvements >> to the current BLAST parsers, but I favour aiming to simply >> deprecate them in favour of 'SearchIO' (namespace to be >> decided). >> >> However, we certainly could try and offer a plain-text BLAST >> output format from 'SearchIO', although IIRC Bow has not tried >> that yet. It shouldn't be too complicated - unless you aim for >> 100% agreement with the latest BLAST output (moving target). > > Yes, this has not been attempted ~ mostly because I feel that the > BLAST plain text is indeed a moving target. But, if we are in favor of > choosing one format from one BLAST version and always stick to it, it > sounds more reasonable. > > There are one missing detail that is only present in the plain text > format, though: the hit-level e-values. If we do decide to write a > plain text writer, we either have to demand the user supply these > values, or we omit the entire hit-level e-value table, or we fill it > with something else. Bow and I have just been over the BLAST+ source code, and confirmed the 'hit level e-value' shown in the plain text description table before the alignments is in fact just the e-value of the best HSP. i.e. The minimum e-value. So that isn't a problem afterall. Peter From w.arindrarto at gmail.com Fri Sep 21 19:03:10 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 22 Sep 2012 01:03:10 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com>

Message-ID: Hi guys, On Fri, Sep 21, 2012 at 3:22 PM, Peter Cock wrote: > On Sat, Sep 15, 2012 at 2:22 PM, Wibowo Arindrarto > wrote: >> Hi guys, >> >>> > 2) If we add a function to Biopython that generates Blast plain-text >>> > output (or something close to it) from Blast XML output, then a user can >>> > generate the Blast output in XML format, parse it with Biopython, >>> > optionally >>> > filter it, and then generate the corresponding plain-text output; >>> >>> The new 'SearchIO' results objects str/repr should be familiar to >>> anyone who has looked at the plain text BLAST output - but >>> not identical. We could apply some of these improvements >>> to the current BLAST parsers, but I favour aiming to simply >>> deprecate them in favour of 'SearchIO' (namespace to be >>> decided). >>> >>> However, we certainly could try and offer a plain-text BLAST >>> output format from 'SearchIO', although IIRC Bow has not tried >>> that yet. It shouldn't be too complicated - unless you aim for >>> 100% agreement with the latest BLAST output (moving target). >> >> Yes, this has not been attempted ~ mostly because I feel that the >> BLAST plain text is indeed a moving target. But, if we are in favor of >> choosing one format from one BLAST version and always stick to it, it >> sounds more reasonable. >> >> There are one missing detail that is only present in the plain text >> format, though: the hit-level e-values. If we do decide to write a >> plain text writer, we either have to demand the user supply these >> values, or we omit the entire hit-level e-value table, or we fill it >> with something else. > > Bow and I have just been over the BLAST+ source code, > and confirmed the 'hit level e-value' shown in the plain text > description table before the alignments is in fact just the > e-value of the best HSP. i.e. The minimum e-value. > > So that isn't a problem afterall. > > Peter Yes, I should've checked first how that e-value gets there. A little peeking into the source code and it was apparent that it's the lowest HSP-level e-value in the hit. So we don't have to worry about calculating new values. For the writing support, I agree with Eric ~ we could use the latest BLAST legacy output as our target plain text format. For parsing, I'm still not sure. Unless there's a massive speed-up, I prefer to keep the current parser as the base given its versatility. Perhaps I can do a bit more 'trimming' so that the parser directly creates SearchIO objects. This won't be a major change to the logic, though. regards, Bow From mmokrejs at fold.natur.cuni.cz Mon Sep 24 21:28:36 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 03:28:36 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: <506108C4.7010102@fold.natur.cuni.cz> Hi Wibowo, will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? I would have to lookup how the record attributes changed(=renamed) from those specific for blast to those generalized and used(=promoted) by SearchIO. Do you have a list of sed regexps? ;-) Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler running overnight and will at least be able to lookup where the bottleneck in the current NCBIXML is. The rest ... next time. ;-) I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing *additionally* the data through "old" names? So that "SearchIO" would expose both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat or hmmer parsers expose so far other attributes as well. I know it is ugly but makes the transition smoother. ;-) Those are just references. ;-) I would like to do something like: try: # latest and greatest biopython version installed from Bio import SearchIO except ImportError: # some old installation from Bio import SeqIO but have the rest of my code unchanged. Umm, I use NCBIXML.parse() so I won't need even the above. You just change it in your git branch and I won't have to touch my code. That's fair, isn't it? ;-) Did you profile biopython or SearchIO yourself? Best, Martin Wibowo Arindrarto wrote: > Hi Martin, > > There is actually already a faster BLAST XML parser written using > cElementTree in Biopython :) (although it's yet to be included in the > main branch). It's part of Biopython's SearchIO module that I recently > wrote (the name SearchIO might change in the future). And indeed, my > early benchmarks has shown that it does perform faster. > > This branch is available here: > https://github.com/bow/biopython/tree/searchio. I've also written a > draft tutorial on how to use it here: > http://bow.web.id/biopython/Tutorial.html#htoc96. > > However, as it's not yet in the current branch, you need to do a > little bit of command line work to set it up: > > 1. Set up a new virtualenv environment (so that it doesn't clash with > your other Biopython installation) and activate it. > 2. Clone the repository: `git clone > https://github.com/bow/biopython.git`, checkout the 'searchio' branch > 3. Run `python setup.py develop`. This will keep the > installation in-sync with any future `git pull` you might perform on > the branch. > > Hope this helps :), > Bow > > > On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs > wrote: >> Hi, >> I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files >> which are then parsed by >> >> from Bio.Blast import NCBIXML >> _blastn_fileh = open(blast_out_xml_filename) >> _blastn_iterator = NCBIXML.parse(_blastn_fileh) >> _record = _blastn_iterator.next() # fetch the very first BLAST result from generator >> >> In my case the blastn searches seem to take longer than takes the XML parsing. :( >> I do not have timing numbers here but wonder why is cElementTree used only in Uniprot >> biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? >> Isn't there any argument when setup.py is called to discern between elementtree, cElementTree >> which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) >> or somebody else will know right away where to look for a performance bottleneck >> and where to change code to use cElementTree which always seemed the fastest to me. >> Thank you for some initial advice. >> Martin >> P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, >> the XML is really an overkill. >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Tue Sep 25 04:09:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 09:09:26 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <506108C4.7010102@fold.natur.cuni.cz> References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs wrote: > Hi Wibowo, > will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? > I would have to lookup how the record attributes changed(=renamed) > from those specific for blast to those generalized and used(=promoted) > by SearchIO. Do you have a list of sed regexps? ;-) > > Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) > Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler > running overnight and will at least be able to lookup where the bottleneck > in the current NCBIXML is. The rest ... next time. ;-) We did discuss updating the internals of the old NCBIXML parser to use ElementTree / cElementTree, but currently the plan is to simply deprecate the old parser, so this seems a wasted effort. > I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing > *additionally* the data through "old" names? So that "SearchIO" would expose > both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat > or hmmer parsers expose so far other attributes as well. I know it is ugly but > makes the transition smoother. ;-) Those are just references. ;-) I would > like to do something like: > > try: > # latest and greatest biopython version installed > from Bio import SearchIO > except ImportError: > # some old installation > from Bio import SeqIO > > > but have the rest of my code unchanged. Umm, I use NCBIXML.parse() > so I won't need even the above. You just change it in your git branch > and I won't have to touch my code. That's fair, isn't it? ;-) The plan is to reward people for updating their code by giving them faster BLAST XML parsing (and an easy way to try out other input file formats in future). Note that Bio.SearchIO is the working name and current namespace used on the branch, but is unlikely to be the final name. And I'm not keen on adding backwards compatible aliases for the old BLAST parser names - even if they did come with deprecation warnings. In fact I suspect even that wouldn't give you the drop in replacement you are hoping for, the object heirachy has changed too. However, if there are some specific cases where you think the old name is still sensible given the broader scope of the new parser covering many other formats as well as BLAST, then some minor renames seems more reasonable. > Did you profile biopython or SearchIO yourself? > Best, > Martin Bow did some profiling of the old NCBIXML parser against his SearchIO work. Peter From w.arindrarto at gmail.com Tue Sep 25 05:53:08 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 25 Sep 2012 11:53:08 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: Hi Martin, Peter, I agree with Peter. The new object model is a bit different from the old one in Bio.Blast, so a simple search & replace might not do the trick. The same goes with the attribute names. I suppose I could add one table in the draft tutorial to list the new attribute names, but I prefer not to have any Bio.Blast-compatible names in the code. As for the profiling, I did some quick benchmarks but it wasn't really thorough. I only compared the parsing times of Bio.Blast.NCBIXML and the new BLAST XML parser in SearchIO. Using a test file containing 1000 BLAST queries (286 Mb total), the results were as follows: on SearchIO: 97.11 93.66 94.13 91.35 90.90 Total time : 467.15 Average : 93.43 on Bio.Blast: 441.45 412.57 471.31 434.22 429.35 Total time : 2188.90 Average : 437.78 The speed-up was almost 5x. I didn't check for any optimizable bottlenecks, though. Hope that helps, Bow On Tue, Sep 25, 2012 at 10:09 AM, Peter Cock wrote: > On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs > wrote: >> Hi Wibowo, >> will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? >> I would have to lookup how the record attributes changed(=renamed) >> from those specific for blast to those generalized and used(=promoted) >> by SearchIO. Do you have a list of sed regexps? ;-) >> >> Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) >> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler >> running overnight and will at least be able to lookup where the bottleneck >> in the current NCBIXML is. The rest ... next time. ;-) > > We did discuss updating the internals of the old NCBIXML parser > to use ElementTree / cElementTree, but currently the plan is to > simply deprecate the old parser, so this seems a wasted effort. > >> I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing >> *additionally* the data through "old" names? So that "SearchIO" would expose >> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat >> or hmmer parsers expose so far other attributes as well. I know it is ugly but >> makes the transition smoother. ;-) Those are just references. ;-) I would >> like to do something like: >> >> try: >> # latest and greatest biopython version installed >> from Bio import SearchIO >> except ImportError: >> # some old installation >> from Bio import SeqIO >> >> >> but have the rest of my code unchanged. Umm, I use NCBIXML.parse() >> so I won't need even the above. You just change it in your git branch >> and I won't have to touch my code. That's fair, isn't it? ;-) > > The plan is to reward people for updating their code by giving them > faster BLAST XML parsing (and an easy way to try out other input > file formats in future). > > Note that Bio.SearchIO is the working name and current namespace > used on the branch, but is unlikely to be the final name. > > And I'm not keen on adding backwards compatible aliases for the old > BLAST parser names - even if they did come with deprecation warnings. > In fact I suspect even that wouldn't give you the drop in replacement > you are hoping for, the object heirachy has changed too. > > However, if there are some specific cases where you think the old > name is still sensible given the broader scope of the new parser > covering many other formats as well as BLAST, then some minor > renames seems more reasonable. > >> Did you profile biopython or SearchIO yourself? >> Best, >> Martin > > Bow did some profiling of the old NCBIXML parser against his > SearchIO work. > > Peter From mjldehoon at yahoo.com Tue Sep 25 06:34:52 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 25 Sep 2012 03:34:52 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> > The same goes with the attribute names. I suppose I > could add one table in the draft tutorial to list the > new attribute names, but I prefer not to have any > Bio.Blast-compatible names in the code. >> I would have to lookup how the record attributes >> changed(=renamed) from those specific for blast to >> those generalized and used(=promoted) >> by SearchIO. >>?I see, hsp.sbjct_start is renamed to hsp.hit_start ... I would suggest to use the same names as in the XML source file. Then we are consistent with NCBI, we don't have to come up with our own names, and we won't have to provide a list of biopython-defined record attributes. Dropping the "Hsp" in , that would be "hit-from". Best, -Michiel. From p.j.a.cock at googlemail.com Tue Sep 25 07:03:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 12:03:15 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Tue, Sep 25, 2012 at 11:34 AM, Michiel de Hoon wrote: >> The same goes with the attribute names. I suppose I >> could add one table in the draft tutorial to list the >> new attribute names, but I prefer not to have any >> Bio.Blast-compatible names in the code. > >>> I would have to lookup how the record attributes >>> changed(=renamed) from those specific for blast to >>> those generalized and used(=promoted) >>> by SearchIO. > >>> I see, hsp.sbjct_start is renamed to hsp.hit_start ... > > I would suggest to use the same names as in the XML > source file. Then we are consistent with NCBI, we don' >t have to come up with our own names, and we won't > have to provide a list of biopython-defined record > attributes. Dropping the "Hsp" in , that > would be "hit-from". We can't be fully consistent with the NCBI since they have more than one naming convention ;) Personally I find the NCBI's human readable column names used in the tabular output far nicer than the verbose terms in the XML which is not really human readable, e.g. slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence The term 'subject' for the hit sequence is quite BLAST specific, but otherwise these terms are reasonably broad and could make sense in SearchIO beyond BLAST (assuming you don't find shortening the subject/query prefix to a single letter confusing). Currently the HSP object in SearchIO uses hit_start, hit_end, query_start and query_end - but also note that we're using Python counting. Peter From mmokrejs at fold.natur.cuni.cz Tue Sep 25 07:15:19 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 13:15:19 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: <50619247.2080906@fold.natur.cuni.cz> Peter Cock wrote: > On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs > wrote: >> Hi Wibowo, >> will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? >> I would have to lookup how the record attributes changed(=renamed) >> from those specific for blast to those generalized and used(=promoted) >> by SearchIO. Do you have a list of sed regexps? ;-) >> >> Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) >> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler >> running overnight and will at least be able to lookup where the bottleneck >> in the current NCBIXML is. The rest ... next time. ;-) > > We did discuss updating the internals of the old NCBIXML parser > to use ElementTree / cElementTree, but currently the plan is to > simply deprecate the old parser, so this seems a wasted effort. Then it means for me that some parts of my code will exist twice. As you said below the structuring of object in searchio vs. NCBIXML is different so I will really need two routines. :( One for newer installation and one for (most) older biopython versions. I would really suggest to spend some effort on optimizing the coding style of the old parser. The gain might be quite substantial and easy to gain for you and at no cost for end users. > >> I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing >> *additionally* the data through "old" names? So that "SearchIO" would expose >> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat >> or hmmer parsers expose so far other attributes as well. I know it is ugly but >> makes the transition smoother. ;-) Those are just references. ;-) I would >> like to do something like: >> >> try: >> # latest and greatest biopython version installed >> from Bio import SearchIO >> except ImportError: >> # some old installation >> from Bio import SeqIO >> >> >> but have the rest of my code unchanged. Umm, I use NCBIXML.parse() >> so I won't need even the above. You just change it in your git branch >> and I won't have to touch my code. That's fair, isn't it? ;-) > > The plan is to reward people for updating their code by giving them > faster BLAST XML parsing (and an easy way to try out other input > file formats in future). That will take a long while for people to switch over. Fix all HOWTOs and other docs all over the websites in the world ... That's a long shot. I would really try to provide a mapping interface so that people can just do the above try/except trick during module import. > > Note that Bio.SearchIO is the working name and current namespace > used on the branch, but is unlikely to be the final name. That's no problem for me. > > And I'm not keen on adding backwards compatible aliases for the old > BLAST parser names - even if they did come with deprecation warnings. > In fact I suspect even that wouldn't give you the drop in replacement > you are hoping for, the object heirachy has changed too. I understand you reasoning but maintaining two copies of functionally same code is boring for users as well. ;-) I can adjust for that myself, sure. > > However, if there are some specific cases where you think the old > name is still sensible given the broader scope of the new parser > covering many other formats as well as BLAST, then some minor > renames seems more reasonable. > >> Did you profile biopython or SearchIO yourself? >> Best, >> Martin > > Bow did some profiling of the old NCBIXML parser against his > SearchIO work. > > Peter > > From mmokrejs at fold.natur.cuni.cz Tue Sep 25 07:26:46 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 13:26:46 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: <506194F6.9000103@fold.natur.cuni.cz> Peter Cock wrote: > Currently the HSP object in SearchIO uses hit_start, > hit_end, query_start and query_end - but also note > that we're using Python counting. Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement my code right now to use searchio but forgot to say that. I already did fix all the off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to rather use 1-based (where human is reading the output text files/tables). And these are scattered through the program (I think) and this will be probably the major stopper for me. ;) Things might break for me all over the places. I am not saying this is good idea but really, providing cElementTree calls from within NCBIXML would be more appealing to me (instead of current python-based expat parser calls). From p.j.a.cock at googlemail.com Tue Sep 25 08:26:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 13:26:48 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <506194F6.9000103@fold.natur.cuni.cz> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 12:26 PM, Martin Mokrejs wrote: > Peter Cock wrote: > >> Currently the HSP object in SearchIO uses hit_start, >> hit_end, query_start and query_end - but also note >> that we're using Python counting. > > Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement > my code right now to use searchio but forgot to say that. I already did fix all the > off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to > rather use 1-based (where human is reading the output text files/tables). And these are > scattered through the program (I think) and this will be probably the major stopper for me. > ;) Things might break for me all over the places. Sadly whenever we are dealing with position input/output there will be off by one adjustments required. I think it is wise to use just one standard internally to a tool, and for Python that means zero based counting. > I am not saying this is good idea but really, providing cElementTree calls from within > NCBIXML would be more appealing to me (instead of current python-based expat parser > calls). OK - so there is at least one person making heaving use of the NCBIXML so we shouldn't rush to deprecate it after merging SearchIO, and there *is* some benefit from making it faster (but with the same API). In principle NCBIXML would be rewritten to use cElementTree /ElementTree and preserve the API - if you or anyone else want to do that (and the unit tests still pass), then I'm happy to review such changes. Likewise for less dramatic optimisations. Regards, Peter From p.j.a.cock at googlemail.com Tue Sep 25 09:32:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 14:32:07 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock wrote: > > OK - so there is at least one person making heaving use of the > NCBIXML so we shouldn't rush to deprecate it after merging > SearchIO, and there *is* some benefit from making it faster > (but with the same API). > > In principle NCBIXML would be rewritten to use cElementTree > /ElementTree and preserve the API - if you or anyone else want > to do that (and the unit tests still pass), then I'm happy to review > such changes. Likewise for less dramatic optimisations. Martin emailed me to ask about this bit of the code, and it can be sped up - this shows about a 5% reduction: https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600 Summary parsing a 286MB XML file from BLASTX 2.2.26+ for 1000 genes against the NR database. NCBIXML before change: About 162s NCBIXML after change: About 154s NCBIXML removing debug: About 152s Using SearchIO: About 79s This is probably the same test file Bow gave numbers for earlier, although it seems SearchIO has less of an advantage on my machine (about x2) compared to Bow's machine (almost x5). (We should check memory usage too...) Peter --------------------------------------------- The full details, Before this change: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 161.8s real 2m41.894s user 2m41.208s sys 0m0.675s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 161.8s real 2m41.984s user 2m41.296s sys 0m0.677s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 162.6s real 2m42.771s user 2m41.995s sys 0m0.763s With this change: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 152.4s real 2m32.582s user 2m31.910s sys 0m0.663s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 153.5s real 2m33.680s user 2m32.977s sys 0m0.695s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 153.8s real 2m33.931s user 2m33.258s sys 0m0.661s And if we go further and remove _debug_ignore_list and this bit of debug code the saving is marginal: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 151.5s real 2m31.611s user 2m30.934s sys 0m0.665s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 151.2s real 2m31.348s user 2m30.664s sys 0m0.674s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 152.9s real 2m32.994s user 2m32.314s sys 0m0.669s This is the timing script I used, $ more /tmp/time_ncbixml.py import sys import time from Bio.Blast import NCBIXML for f in sys.argv[1:]: start = time.time() count = 0 handle = open(f) for record in NCBIXML.parse(handle): count += 1 handle.close() print "%i records in %s in %0.1fs" % (count, f, time.time() - start) #End of file For comparison, here is the timing on the same setup but using SearchIO from Bow's current branch: $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 79.1s real 1m19.259s user 1m18.397s sys 0m0.799s $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 78.7s real 1m18.878s user 1m18.149s sys 0m0.719s $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 79.5s real 1m19.611s user 1m18.683s sys 0m0.918s And the script: $ more /tmp/time_searchio.py import sys import time from Bio import SearchIO for f in sys.argv[1:]: start = time.time() count = 0 handle = open(f) for record in SearchIO.parse(handle, "blast-xml"): count += 1 handle.close() print "%i records in %s in %0.1fs" % (count, f, time.time() - start) #End of file From golubchi at stats.ox.ac.uk Tue Sep 25 10:39:11 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Tue, 25 Sep 2012 15:39:11 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz>

Message-ID: <5061C20F.7040209@stats.ox.ac.uk> Hello, Apologies for not having followed the entire discussion, but just wanted to say that we're also using NCBIXML here and are likely to be incorporating it in a new piece of software soon, so it would be really unfortunate if some tags disappeared, were renamed or (even worse) changed meaning in future releases. I'm a bit late coming in here so maybe this has been answered, but is there a better parser that should be used at the moment? I was under the impression that NCBIXML is the only one. Thanks, Tanya On 25/09/12 14:32, Peter Cock wrote: > On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock wrote: >> >> OK - so there is at least one person making heaving use of the >> NCBIXML so we shouldn't rush to deprecate it after merging >> SearchIO, and there *is* some benefit from making it faster >> (but with the same API). >> >> In principle NCBIXML would be rewritten to use cElementTree >> /ElementTree and preserve the API - if you or anyone else want >> to do that (and the unit tests still pass), then I'm happy to review >> such changes. Likewise for less dramatic optimisations. > > Martin emailed me to ask about this bit of the code, and it > can be sped up - this shows about a 5% reduction: > https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600 > > Summary parsing a 286MB XML file from BLASTX 2.2.26+ > for 1000 genes against the NR database. > > NCBIXML before change: About 162s > NCBIXML after change: About 154s > NCBIXML removing debug: About 152s > Using SearchIO: About 79s > > This is probably the same test file Bow gave numbers for earlier, > although it seems SearchIO has less of an advantage on my > machine (about x2) compared to Bow's machine (almost x5). > > (We should check memory usage too...) > > Peter > > --------------------------------------------- > > The full details, > > Before this change: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 161.8s > > real 2m41.894s > user 2m41.208s > sys 0m0.675s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 161.8s > > real 2m41.984s > user 2m41.296s > sys 0m0.677s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 162.6s > > real 2m42.771s > user 2m41.995s > sys 0m0.763s > > > With this change: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 152.4s > > real 2m32.582s > user 2m31.910s > sys 0m0.663s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 153.5s > > real 2m33.680s > user 2m32.977s > sys 0m0.695s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 153.8s > > real 2m33.931s > user 2m33.258s > sys 0m0.661s > > And if we go further and remove _debug_ignore_list and > this bit of debug code the saving is marginal: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 151.5s > > real 2m31.611s > user 2m30.934s > sys 0m0.665s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 151.2s > > real 2m31.348s > user 2m30.664s > sys 0m0.674s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 152.9s > > real 2m32.994s > user 2m32.314s > sys 0m0.669s > > This is the timing script I used, > > $ more /tmp/time_ncbixml.py > import sys > import time > from Bio.Blast import NCBIXML > for f in sys.argv[1:]: > start = time.time() > count = 0 > handle = open(f) > for record in NCBIXML.parse(handle): > count += 1 > handle.close() > print "%i records in %s in %0.1fs" % (count, f, time.time() - start) > #End of file > > For comparison, here is the timing on the same setup but using > SearchIO from Bow's current branch: > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 79.1s > > real 1m19.259s > user 1m18.397s > sys 0m0.799s > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 78.7s > > real 1m18.878s > user 1m18.149s > sys 0m0.719s > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 79.5s > > real 1m19.611s > user 1m18.683s > sys 0m0.918s > > And the script: > > $ more /tmp/time_searchio.py > import sys > import time > from Bio import SearchIO > for f in sys.argv[1:]: > start = time.time() > count = 0 > handle = open(f) > for record in SearchIO.parse(handle, "blast-xml"): > count += 1 > handle.close() > print "%i records in %s in %0.1fs" % (count, f, time.time() - start) > #End of file > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Sep 25 12:00:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 17:00:45 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz>

<5061C20F.7040209@stats.ox.ac.uk> Message-ID: On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik wrote: > Hello, > > Apologies for not having followed the entire discussion, but just wanted > to say that we're also using NCBIXML here and are likely to be > incorporating it in a new piece of software soon, so it would be really > unfortunate if some tags disappeared, were renamed or (even worse) > changed meaning in future releases. > > I'm a bit late coming in here so maybe this has been answered, but is > there a better parser that should be used at the moment? I was under the > impression that NCBIXML is the only one. > > Thanks, > Tanya Hi Tanya, I hope I can reassure you there is nothing to worry about :) Right now there is only the NCBIXML parser, and we're not going to change it (except possibly to make it a little faster if people want to work on that). We're planning to a add new module based on Bow's GSoC code, under the working name SearchIO, which would cover BLAST, BLAT, HMMER, etc. This would have a different API and in the long term would probably replace all of Bio.Blast. http://biopython.org/wiki/SearchIO The discussion about possible changes has been (I think) only about this new code (and would have been better off on the development mailing list but this thread went off on a slight tangent). Once 'SearchIO' is released, we'd want to encourage people to use that instead of NCBIXML, with a view to deprecating and eventually removing NCBIXML. See: http://biopython.org/wiki/Deprecation_policy Regards, Peter From golubchi at stats.ox.ac.uk Thu Sep 27 07:35:35 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 27 Sep 2012 12:35:35 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz>

<5061C20F.7040209@stats.ox.ac.uk> Message-ID: <50643A07.5080504@stats.ox.ac.uk> Thanks, Peter, that's good to know. Cheers, Tanya On 25/09/12 17:00, Peter Cock wrote: > On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik > wrote: >> Hello, >> >> Apologies for not having followed the entire discussion, but just wanted >> to say that we're also using NCBIXML here and are likely to be >> incorporating it in a new piece of software soon, so it would be really >> unfortunate if some tags disappeared, were renamed or (even worse) >> changed meaning in future releases. >> >> I'm a bit late coming in here so maybe this has been answered, but is >> there a better parser that should be used at the moment? I was under the >> impression that NCBIXML is the only one. >> >> Thanks, >> Tanya > > Hi Tanya, > > I hope I can reassure you there is nothing to worry about :) > > Right now there is only the NCBIXML parser, and we're not going > to change it (except possibly to make it a little faster if people > want to work on that). > > We're planning to a add new module based on Bow's GSoC > code, under the working name SearchIO, which would cover > BLAST, BLAT, HMMER, etc. This would have a different API > and in the long term would probably replace all of Bio.Blast. > http://biopython.org/wiki/SearchIO > > The discussion about possible changes has been (I think) > only about this new code (and would have been better off > on the development mailing list but this thread went off on > a slight tangent). > > Once 'SearchIO' is released, we'd want to encourage > people to use that instead of NCBIXML, with a view to > deprecating and eventually removing NCBIXML. See: > http://biopython.org/wiki/Deprecation_policy > > Regards, > > Peter From idoerg at gmail.com Fri Sep 28 11:34:03 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 28 Sep 2012 11:34:03 -0400 Subject: [Biopython] OBO parser? Message-ID: There has been some talk about a Biopython OBO parser recently: http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009874.html I am not sure exactly where this stands... would be very interested to see one. In any case, I have the following question, as I need something now: is there an OBO or an OWL parser in Python out there? Thanks, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From mike.thon at gmail.com Sun Sep 30 01:05:36 2012 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 30 Sep 2012 07:05:36 +0200 Subject: [Biopython] support for NCBI's .tbl format Message-ID: <11395A73-2BFE-4064-BA6D-5F90C9235027@gmail.com> I wonder if there is any support in BioPython for outputting sequence features in NCBI's .tbl format which is needed for running tbl2asn. If not, where would it belong? In Bio.SeqIO? From p.j.a.cock at googlemail.com Sun Sep 30 05:42:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 30 Sep 2012 10:42:15 +0100 Subject: [Biopython] support for NCBI's .tbl format In-Reply-To: <11395A73-2BFE-4064-BA6D-5F90C9235027@gmail.com> References: <11395A73-2BFE-4064-BA6D-5F90C9235027@gmail.com> Message-ID: On Sun, Sep 30, 2012 at 6:05 AM, Michael Thon wrote: > I wonder if there is any support in BioPython for outputting > sequence features in NCBI's .tbl format which is needed for > running tbl2asn. If not, where would it belong? In Bio.SeqIO? No there isn't, but it sounds like it could go in Bio.SeqIO if you wrote it to operate the SeqRecord level. Internally you could have a lower level SeqFeature based API... Peter From jttkim at googlemail.com Mon Sep 3 09:31:23 2012 From: jttkim at googlemail.com (Jan T Kim) Date: Mon, 3 Sep 2012 10:31:23 +0100 Subject: [Biopython] Start positions for local pairwise alignments? Message-ID: <20120903093121.GA4129@paxarchia.galaxy.uni> Dear All, after reading a pairwise alignment computed using the EMBOSS water program, is it possible to find out the indices of the sequences in the local alignment within the input sequences? As an illustration, the sequences "tttagagccc" and "ccagagc" align to s1 4 agagc 8 ||||| s2 3 agagc 7 This local alignment doesn't contain the prefixes "ttt" and "cc", respectively. In the water output above, that's reflected by the start indices 4 and 3, respectively. However, after reading that result with import Bio.AlignIO aStream = Bio.AlignIO.parse('s1s2_align.txt', 'emboss') a = aStream.next() print a print a.__dict__ print a[0] print a[0].__dict__ I can't seem to find that information anywhere either in the resulting Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects that it contains. So, am I looking at the wrong place? Best regards, Jan P.S.: For a while I was convinced that I had seen these indices but it's now occurred to me that that was actually in the pysam.AlignedRead class, which contains the indices of the read in the reference sequence, in the positions instance variable... -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Thu Sep 6 00:01:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:01:46 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: <20120903093121.GA4129@paxarchia.galaxy.uni> References: <20120903093121.GA4129@paxarchia.galaxy.uni> Message-ID: On Mon, Sep 3, 2012 at 10:31 AM, Jan T Kim wrote: > Dear All, > > after reading a pairwise alignment computed using the EMBOSS water > program, is it possible to find out the indices of the sequences in > the local alignment within the input sequences? > > ... > > I can't seem to find that information anywhere either in the resulting > Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects > that it contains. > > So, am I looking at the wrong place? No, these number are not currently being parsed. This applies to some of the other file formats in AlignIO too, because we (still) don't have an agreed way to store this in our object model. Last time I used this parser, I was probably using needle rather than water, where these are global alignments so you don't need the start/end values. Peter From jocelyne at gmail.com Thu Sep 6 20:31:06 2012 From: jocelyne at gmail.com (Jocelyne) Date: Thu, 6 Sep 2012 13:31:06 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References: Message-ID: Hello: First, I'd like to say that I really appreciate the effort of the community to provide us with such a nice package. I found some odd scoring behavior with the pairwise2 local alignment (see 5 below). I think these 2 alignments should have the same score. First required details: 1) Which operating system and hardware (32 bit or 64 bit) you are using Linux jocelyne-VirtualBox 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux (basically, latest Ubuntu 64 bit through virtual box) 2) Python version Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 3) Biopython version (or git version/date) (installed through repository) python-biopython/precise uptodate 1.58-1 4) Traceback that occurs (the full error message) None 5) A data file that causes the problem None 6) Example code that breaks I feel the two local alignments should both score 4. I think it has to do with how the top row and left columns are filled in the score matrix. ================================================================================ >>> for a in pairwise2.align.localms("ACTGAGT", "TGC", 2, -1, -100, -100, force_generic = True): ... print a ... print pairwise2.format_alignment(*a)... ('ACTGAGT', '--TGC--', 4, 2, 4) ACTGAGT || --TGC-- Score=4 >>> for a in pairwise2.align.localms("ACTGAGT", "CGA", 2, -1, -100, -100, force_generic = True): ... print a ... print pairwise2.format_alignment(*a)... ('ACTGAGT', '--CGA--', 3, 3, 5) ACTGAGT || --CGA-- Score=3 ================================================================================ I outputted the matrices ================================================================================ >>> score_matrix, trace_matrix = pairwise2._make_score_matrix_generic("ACTGAGT", "TGC", pairwise2.identity_match(2, -1), pairwise2.affine_penalty(-100, -100), pairwise2.affine_penalty(-100, -100), False, False,False, False) >>> pairwise2.print_matrix(score_matrix)-1 -1 -1 -1 0 1 2 0 0 -1 4 0 -1 0 3 -1 1 0 2 0 0 >>> score_matrix, trace_matrix = pairwise2._make_score_matrix_generic("ACTGAGT", "CGA", pairwise2.identity_match(2, -1), pairwise2.affine_penalty(-100, -100), pairwise2.affine_penalty(-100, -100), False, False,False, False) >>> pairwise2.print_matrix(score_matrix)-1 -1 2 2 0 0 -1 1 0 -1 1 0 -1 0 3 -1 1 0 -1 0 0 Let me know if there is a quick fix I can do on my side. Thanks! Jocelyne From p.j.a.cock at googlemail.com Fri Sep 7 03:59:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 04:59:57 +0100 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > Hello: > First, I'd like to say that I really appreciate the effort of the community > to provide us with such a nice package. > I found some odd scoring behavior with the pairwise2 local alignment (see 5 > below). I think these 2 alignments should have the same score. Hmm. I'm not overly familiar with this bit of the code, but did occur to me it might be something related to this open issue: https://redmine.open-bio.org/issues/2776 I was able to repeat your pairwise2.align.localms example and the score matrix example a Mac using the latest code from github, and got the same answers. So (as I suspected) this does not seem to be a platform specific issue. Unfortunately the original author of this code (Jeff Chang) isn't active with Biopython anymore - we can try emailing him directly, but if you're willing to look into this in more detail and can propose a fix, I'm happy to take a look at merging it. Peter From jocelyne at gmail.com Fri Sep 7 06:15:59 2012 From: jocelyne at gmail.com (Jocelyne) Date: Thu, 6 Sep 2012 23:15:59 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Peter: I added 4 lines of code in each snippet below (there are copies of the same code). I'm pretty sure it should fix it (there are copies of line 438-439, with the indexes changed). Basically, the previous code allowed for negative scores in the first row and column of the matrix, even in the case of local alignments (in which case scores shouldn't go negative). I didn't test it, so please make sure it works before merging. Also, it seems that it imports _make_score_matrix_fasta from a C library (line 851), which overload the corresponding python function, so that would have to be fixed too. Thanks! Jocelyne 378 # The top and left borders of the matrices are special cases 379 # because there are no previously aligned characters. To simplify 380 # the main loop, handle these separately. 381 for i in range(lenA): 382 # Align the first residue in sequenceB to the ith residue in 383 # sequence A. This is like opening up i gaps at the beginning 384 # of sequence B. 385 score = match_fn(sequenceA[i], sequenceB[0]) 386 if penalize_end_gaps: 387 score += gap_B_fn(0, i) 388 score_matrix[i][0] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[i][0] = 0 389 for i in range(1, lenB): 390 score = match_fn(sequenceA[0], sequenceB[i]) 391 if penalize_end_gaps: 392 score += gap_A_fn(0, i) 393 score_matrix[0][i] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[0][i] = 0 461 # The top and left borders of the matrices are special cases 462 # because there are no previously aligned characters. To simplify 463 # the main loop, handle these separately. 464 for i in range(lenA): 465 # Align the first residue in sequenceB to the ith residue in 466 # sequence A. This is like opening up i gaps at the beginning 467 # of sequence B. 468 score = match_fn(sequenceA[i], sequenceB[0]) 469 if penalize_end_gaps: 470 score += calc_affine_penalty( 471 i, open_B, extend_B, penalize_extend_when_opening) 472 score_matrix[i][0] = score +++ if not align_globally and score_matrix[i][0] < 0: +++ score_matrix[i][0] = 0 473 for i in range(1, lenB): 474 score = match_fn(sequenceA[0], sequenceB[i]) 475 if penalize_end_gaps: 476 score += calc_affine_penalty( 477 i, open_A, extend_A, penalize_extend_when_opening) 478 score_matrix[0][i] = score +++ if not align_globally and score_matrix[0][i] < 0: +++ score_matrix[0][i] = 0 On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > > Hello: > > First, I'd like to say that I really appreciate the effort of the > community > > to provide us with such a nice package. > > I found some odd scoring behavior with the pairwise2 local alignment > (see 5 > > below). I think these 2 alignments should have the same score. > > Hmm. I'm not overly familiar with this bit of the code, but did > occur to me it might be something related to this open issue: > > https://redmine.open-bio.org/issues/2776 > > I was able to repeat your pairwise2.align.localms example > and the score matrix example a Mac using the latest code > from github, and got the same answers. So (as I suspected) > this does not seem to be a platform specific issue. > > Unfortunately the original author of this code (Jeff Chang) > isn't active with Biopython anymore - we can try emailing > him directly, but if you're willing to look into this in more > detail and can propose a fix, I'm happy to take a look at > merging it. > > Peter > From jttkim at googlemail.com Fri Sep 7 08:51:36 2012 From: jttkim at googlemail.com (Jan T Kim) Date: Fri, 7 Sep 2012 09:51:36 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: References: <20120903093121.GA4129@paxarchia.galaxy.uni> Message-ID: <20120907085134.GA4094@paxarchia.galaxy.uni> On Thu, Sep 06, 2012 at 01:01:46AM +0100, Peter Cock wrote: > On Mon, Sep 3, 2012 at 10:31 AM, Jan T Kim wrote: > > Dear All, > > > > after reading a pairwise alignment computed using the EMBOSS water > > program, is it possible to find out the indices of the sequences in > > the local alignment within the input sequences? > > > > ... > > > > I can't seem to find that information anywhere either in the resulting > > Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects > > that it contains. > > > > So, am I looking at the wrong place? > > No, these number are not currently being parsed. This applies to > some of the other file formats in AlignIO too, because we (still) > don't have an agreed way to store this in our object model. Ok, thanks for clarifying. I think I understand, I wasn't sure whether to expect that information in the Seq, the SeqRecord or the MultipleAlignment objects. For what it's worth, it currently would seem most adequate to me if a (say) AlignedSeq subclass of Seq could provide a couple of optional additional instance variables, such as the start index of the aligned sequence within the input sequence. I'd envision this information to be optional in the sense that the instance variable would be None if the start position is not available, which would obviously be the case for some alignment formats (for most multiple alignments, in fact). > Last time I used this parser, I was probably using needle rather > than water, where these are global alignments so you don't need > the start/end values. Incidentally I initially used needle as well, but then got additional data which contained elevated levels of "junk", which required a switch to local alignments. In this case there was a region of interest with a subsequence that was unique, so I could figure out whether the region of interest was aligned or not, but that approach can be unreliable when repetitive regions are involved and / or definitions of the "region of interest" are subject to shifts. So I'd think having the start index where available would be useful in the long run. Best regards & have a nice weekend all, Jan -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Fri Sep 7 14:47:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 15:47:00 +0100 Subject: [Biopython] Start positions for local pairwise alignments? In-Reply-To: <20120907085134.GA4094@paxarchia.galaxy.uni> References: <20120903093121.GA4129@paxarchia.galaxy.uni> <20120907085134.GA4094@paxarchia.galaxy.uni> Message-ID: On Fri, Sep 7, 2012 at 9:51 AM, Jan T Kim wrote: > > Ok, thanks for clarifying. I think I understand, I wasn't sure whether to > expect that information in the Seq, the SeqRecord or the MultipleAlignment > objects. > > For what it's worth, it currently would seem most adequate to me > if a (say) AlignedSeq subclass of Seq could provide a couple of > optional additional instance variables, such as the start index > of the aligned sequence within the input sequence. > > I'd envision this information to be optional in the sense that the > instance variable would be None if the start position is not > available, which would obviously be the case for some alignment > formats (for most multiple alignments, in fact). > That's exactly what we hope to have in the next release, see: http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009930.html Regards, Peter From semenko at alum.mit.edu Tue Sep 11 01:18:25 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 10 Sep 2012 20:18:25 -0500 Subject: [Biopython] Removing HotRand.py? Message-ID: I've submitted a pull request to deprecate Bio/HotRand.py Is anyone still using the HotRandom functions? It looks like the module is pretty old and there are some better alternatives: http://pypi.python.org/pypi/randomdotorg/ Pull Request: https://github.com/biopython/biopython/pull/69 Best, Nick Semenkovich -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From hernan.morales at gmail.com Tue Sep 11 13:38:12 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Tue, 11 Sep 2012 15:38:12 +0200 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87y5mhkxtc.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: 2012/7/18 Brad Chapman > > Dilara; > > > I'm trying to understand what is why when I print filtered_rec I get a > > SeqRecord but if I try to access any particular attribute of a SeqRecord > > such as letter_annotations I sometimes get an attribute error -- > > AttributeError: 'NoneType' object has no attribute > > 'letter_annotations.' > > > def check_meanQ(record, q_threshold): > > seqlen=len(record) > > quality_scores=array(record.letter_annotations["phred_quality"]) > > if round(quality_scores.mean()) <= q_threshold: > > print "Discarded ", record.id, "because mean Q was", > > round(quality_scores.mean()) > > elif round(quality_scores.mean()) > q_threshold: > > return record > > This function returns different results based on the comparison of > mean quality scores to your threshold: > > - When it is below the threshold, it returns None (since you do not > define an explicit return value) > - When it is above the threshold, it returns a SeqRecord. > > And of course, you may implement a Null Object Pattern here, like a NullSeqRecord. Cheers, Hern?n From hernan.morales at gmail.com Tue Sep 11 13:38:12 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Tue, 11 Sep 2012 15:38:12 +0200 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: <87y5mhkxtc.fsf@fastmail.fm> References: <87y5mhkxtc.fsf@fastmail.fm> Message-ID: 2012/7/18 Brad Chapman > > Dilara; > > > I'm trying to understand what is why when I print filtered_rec I get a > > SeqRecord but if I try to access any particular attribute of a SeqRecord > > such as letter_annotations I sometimes get an attribute error -- > > AttributeError: 'NoneType' object has no attribute > > 'letter_annotations.' > > > def check_meanQ(record, q_threshold): > > seqlen=len(record) > > quality_scores=array(record.letter_annotations["phred_quality"]) > > if round(quality_scores.mean()) <= q_threshold: > > print "Discarded ", record.id, "because mean Q was", > > round(quality_scores.mean()) > > elif round(quality_scores.mean()) > q_threshold: > > return record > > This function returns different results based on the comparison of > mean quality scores to your threshold: > > - When it is below the threshold, it returns None (since you do not > define an explicit return value) > - When it is above the threshold, it returns a SeqRecord. > > And of course, you may implement a Null Object Pattern here, like a NullSeqRecord. Cheers, Hern?n From p.j.a.cock at googlemail.com Tue Sep 11 16:03:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Sep 2012 17:03:21 +0100 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Jocelyne, The reason for the C code is speed. The pure Python code is a fall back for systems where you can't use this - for example PyPy or Jython. To test your (pure Python) fix you'd have to comment out the C library import line. Ideally we'd prefer a combined fix which also updates the C implementation to match. Are you on Windows? That does complicate this - whereas for the Mac or Linux (re)compiling Biopython from source should be quite easy. Peter On Fri, Sep 7, 2012 at 7:15 AM, Jocelyne wrote: > Hi Peter: > I added 4 lines of code in each snippet below (there are copies of the same > code). I'm pretty sure it should fix it (there are copies of line 438-439, > with the indexes changed). Basically, the previous code allowed for negative > scores in the first row and column of the matrix, even in the case of local > alignments (in which case scores shouldn't go negative). I didn't test it, > so please make sure it works before merging. > > Also, it seems that it imports _make_score_matrix_fasta from a C library > (line 851), which overload the corresponding python function, so that would > have to be fixed too. > > Thanks! > Jocelyne > > > > 378 # The top and left borders of the matrices are special cases > 379 # because there are no previously aligned characters. To simplify > 380 # the main loop, handle these separately. > 381 for i in range(lenA): > 382 # Align the first residue in sequenceB to the ith residue in > 383 # sequence A. This is like opening up i gaps at the beginning > 384 # of sequence B. > 385 score = match_fn(sequenceA[i], sequenceB[0]) > 386 if penalize_end_gaps: > 387 score += gap_B_fn(0, i) > 388 score_matrix[i][0] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[i][0] = 0 > 389 for i in range(1, lenB): > 390 score = match_fn(sequenceA[0], sequenceB[i]) > 391 if penalize_end_gaps: > 392 score += gap_A_fn(0, i) > 393 score_matrix[0][i] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[0][i] = 0 > > > 461 # The top and left borders of the matrices are special cases > 462 # because there are no previously aligned characters. To simplify > 463 # the main loop, handle these separately. > 464 for i in range(lenA): > 465 # Align the first residue in sequenceB to the ith residue in > 466 # sequence A. This is like opening up i gaps at the beginning > 467 # of sequence B. > 468 score = match_fn(sequenceA[i], sequenceB[0]) > 469 if penalize_end_gaps: > 470 score += calc_affine_penalty( > 471 i, open_B, extend_B, penalize_extend_when_opening) > 472 score_matrix[i][0] = score > +++ if not align_globally and score_matrix[i][0] < 0: > +++ score_matrix[i][0] = 0 > 473 for i in range(1, lenB): > 474 score = match_fn(sequenceA[0], sequenceB[i]) > 475 if penalize_end_gaps: > 476 score += calc_affine_penalty( > 477 i, open_A, extend_A, penalize_extend_when_opening) > 478 score_matrix[0][i] = score > +++ if not align_globally and score_matrix[0][i] < 0: > +++ score_matrix[0][i] = 0 > > > > On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock > wrote: >> >> On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: >> > Hello: >> > First, I'd like to say that I really appreciate the effort of the >> > community >> > to provide us with such a nice package. >> > I found some odd scoring behavior with the pairwise2 local alignment >> > (see 5 >> > below). I think these 2 alignments should have the same score. >> >> Hmm. I'm not overly familiar with this bit of the code, but did >> occur to me it might be something related to this open issue: >> >> https://redmine.open-bio.org/issues/2776 >> >> I was able to repeat your pairwise2.align.localms example >> and the score matrix example a Mac using the latest code >> from github, and got the same answers. So (as I suspected) >> this does not seem to be a platform specific issue. >> >> Unfortunately the original author of this code (Jeff Chang) >> isn't active with Biopython anymore - we can try emailing >> him directly, but if you're willing to look into this in more >> detail and can propose a fix, I'm happy to take a look at >> merging it. >> >> Peter > > From jocelyne at gmail.com Wed Sep 12 01:41:31 2012 From: jocelyne at gmail.com (Jocelyne) Date: Tue, 11 Sep 2012 18:41:31 -0700 Subject: [Biopython] bug with pairwise2 local alignments? In-Reply-To: References:

Message-ID: Hi Peter: I understand the C code would need to be fixed too, and it should be a fairly quick fix, but I unfortunately don't have much time on my hands at the moment. When I have more time, I'll see about fixing the bug in both python and C, recompiling and testing. I thought it would be good for the community to at least be aware of this bug. Jocelyne On Tue, Sep 11, 2012 at 9:03 AM, Peter Cock wrote: > Hi Jocelyne, > > The reason for the C code is speed. The pure Python code is a fall > back for systems where you can't use this - for example PyPy or Jython. > > To test your (pure Python) fix you'd have to comment out the C library > import line. Ideally we'd prefer a combined fix which also updates the > C implementation to match. Are you on Windows? That does complicate > this - whereas for the Mac or Linux (re)compiling Biopython from source > should be quite easy. > > Peter > > On Fri, Sep 7, 2012 at 7:15 AM, Jocelyne wrote: > > Hi Peter: > > I added 4 lines of code in each snippet below (there are copies of the > same > > code). I'm pretty sure it should fix it (there are copies of line > 438-439, > > with the indexes changed). Basically, the previous code allowed for > negative > > scores in the first row and column of the matrix, even in the case of > local > > alignments (in which case scores shouldn't go negative). I didn't test > it, > > so please make sure it works before merging. > > > > Also, it seems that it imports _make_score_matrix_fasta from a C library > > (line 851), which overload the corresponding python function, so that > would > > have to be fixed too. > > > > Thanks! > > Jocelyne > > > > > > > > 378 # The top and left borders of the matrices are special cases > > 379 # because there are no previously aligned characters. To > simplify > > 380 # the main loop, handle these separately. > > 381 for i in range(lenA): > > 382 # Align the first residue in sequenceB to the ith residue in > > 383 # sequence A. This is like opening up i gaps at the > beginning > > 384 # of sequence B. > > 385 score = match_fn(sequenceA[i], sequenceB[0]) > > 386 if penalize_end_gaps: > > 387 score += gap_B_fn(0, i) > > 388 score_matrix[i][0] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[i][0] = 0 > > 389 for i in range(1, lenB): > > 390 score = match_fn(sequenceA[0], sequenceB[i]) > > 391 if penalize_end_gaps: > > 392 score += gap_A_fn(0, i) > > 393 score_matrix[0][i] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[0][i] = 0 > > > > > > 461 # The top and left borders of the matrices are special cases > > 462 # because there are no previously aligned characters. To > simplify > > 463 # the main loop, handle these separately. > > 464 for i in range(lenA): > > 465 # Align the first residue in sequenceB to the ith residue in > > 466 # sequence A. This is like opening up i gaps at the > beginning > > 467 # of sequence B. > > 468 score = match_fn(sequenceA[i], sequenceB[0]) > > 469 if penalize_end_gaps: > > 470 score += calc_affine_penalty( > > 471 i, open_B, extend_B, penalize_extend_when_opening) > > 472 score_matrix[i][0] = score > > +++ if not align_globally and score_matrix[i][0] < 0: > > +++ score_matrix[i][0] = 0 > > 473 for i in range(1, lenB): > > 474 score = match_fn(sequenceA[0], sequenceB[i]) > > 475 if penalize_end_gaps: > > 476 score += calc_affine_penalty( > > 477 i, open_A, extend_A, penalize_extend_when_opening) > > 478 score_matrix[0][i] = score > > +++ if not align_globally and score_matrix[0][i] < 0: > > +++ score_matrix[0][i] = 0 > > > > > > > > On Thu, Sep 6, 2012 at 8:59 PM, Peter Cock > > wrote: > >> > >> On Thu, Sep 6, 2012 at 9:31 PM, Jocelyne wrote: > >> > Hello: > >> > First, I'd like to say that I really appreciate the effort of the > >> > community > >> > to provide us with such a nice package. > >> > I found some odd scoring behavior with the pairwise2 local alignment > >> > (see 5 > >> > below). I think these 2 alignments should have the same score. > >> > >> Hmm. I'm not overly familiar with this bit of the code, but did > >> occur to me it might be something related to this open issue: > >> > >> https://redmine.open-bio.org/issues/2776 > >> > >> I was able to repeat your pairwise2.align.localms example > >> and the score matrix example a Mac using the latest code > >> from github, and got the same answers. So (as I suspected) > >> this does not seem to be a platform specific issue. > >> > >> Unfortunately the original author of this code (Jeff Chang) > >> isn't active with Biopython anymore - we can try emailing > >> him directly, but if you're willing to look into this in more > >> detail and can propose a fix, I'm happy to take a look at > >> merging it. > >> > >> Peter > > > > > From mmokrejs at fold.natur.cuni.cz Thu Sep 13 15:20:13 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Sep 2012 17:20:13 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? Message-ID: <5051F9AD.104@fold.natur.cuni.cz> Hi, I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files which are then parsed by from Bio.Blast import NCBIXML _blastn_fileh = open(blast_out_xml_filename) _blastn_iterator = NCBIXML.parse(_blastn_fileh) _record = _blastn_iterator.next() # fetch the very first BLAST result from generator In my case the blastn searches seem to take longer than takes the XML parsing. :( I do not have timing numbers here but wonder why is cElementTree used only in Uniprot biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? Isn't there any argument when setup.py is called to discern between elementtree, cElementTree which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) or somebody else will know right away where to look for a performance bottleneck and where to change code to use cElementTree which always seemed the fastest to me. Thank you for some initial advice. Martin P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, the XML is really an overkill. From w.arindrarto at gmail.com Thu Sep 13 15:40:41 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Sep 2012 17:40:41 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5051F9AD.104@fold.natur.cuni.cz> References: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: Hi Martin, There is actually already a faster BLAST XML parser written using cElementTree in Biopython :) (although it's yet to be included in the main branch). It's part of Biopython's SearchIO module that I recently wrote (the name SearchIO might change in the future). And indeed, my early benchmarks has shown that it does perform faster. This branch is available here: https://github.com/bow/biopython/tree/searchio. I've also written a draft tutorial on how to use it here: http://bow.web.id/biopython/Tutorial.html#htoc96. However, as it's not yet in the current branch, you need to do a little bit of command line work to set it up: 1. Set up a new virtualenv environment (so that it doesn't clash with your other Biopython installation) and activate it. 2. Clone the repository: `git clone https://github.com/bow/biopython.git`, checkout the 'searchio' branch 3. Run `python setup.py develop`. This will keep the installation in-sync with any future `git pull` you might perform on the branch. Hope this helps :), Bow On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs wrote: > Hi, > I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files > which are then parsed by > > from Bio.Blast import NCBIXML > _blastn_fileh = open(blast_out_xml_filename) > _blastn_iterator = NCBIXML.parse(_blastn_fileh) > _record = _blastn_iterator.next() # fetch the very first BLAST result from generator > > In my case the blastn searches seem to take longer than takes the XML parsing. :( > I do not have timing numbers here but wonder why is cElementTree used only in Uniprot > biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? > Isn't there any argument when setup.py is called to discern between elementtree, cElementTree > which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) > or somebody else will know right away where to look for a performance bottleneck > and where to change code to use cElementTree which always seemed the fastest to me. > Thank you for some initial advice. > Martin > P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, > the XML is really an overkill. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Fri Sep 14 00:37:08 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 Sep 2012 17:37:08 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> --- On Thu, 9/13/12, Martin Mokrejs wrote: > P.S.: And yes, I would love to parse blastn plaintext output > or some other more compact one, the XML is really an overkill. What exactly is the advantage of plain text parsing compared to XML? File size? Best, -Michiel. From cjfields at illinois.edu Fri Sep 14 01:32:19 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 14 Sep 2012 01:32:19 +0000 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> On Sep 13, 2012, at 7:37 PM, Michiel de Hoon wrote: > --- On Thu, 9/13/12, Martin Mokrejs wrote: >> P.S.: And yes, I would love to parse blastn plaintext output >> or some other more compact one, the XML is really an overkill. > > What exactly is the advantage of plain text parsing compared to XML? File size? > > Best, > -Michiel. There isn't any. In fact, NCBI has consistently stated that one should never rely on parsing BLAST text output, primarily b/c they reserve the right to make changes to the output at any given point, whereas XML output should remain stable. As someone who has taken care of legacy BLAST code for a number of years (BioPerl), I can state that is fairly close to the truth (the caveat being they have made changes that break some XML parsing, but they do try to fix them). BLAST XML has simply been much easier to deal with in terms of fixing issues than text. chris From mmokrejs at fold.natur.cuni.cz Fri Sep 14 08:12:10 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 10:12:10 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> Message-ID: <5052E6DA.7080703@fold.natur.cuni.cz> Hi all, as a long-term subscriber to this list and bioperl in the past as well I do know that the plaintext output is being changed silently and that it is a hassle to maintainers. On the other hand, the XML tags and syntax is way too verbose. That in turn means lots of disc&memory IO, long parsing times and of course file size. At least if the XML tags would be scrambled to be shorter strings. ;-) Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: https://redmine.open-bio.org/issues/3354 A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271 sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB in size. Parsing the XML file through biopython takes 56 minutes to convert the results into my own CSV file (some overhead could be my program, sure). Doing a full Smith-Waterman search using 8 queries takes just 126 minutes. The times are from filestamps so it is a wall-clock time. I will try to find some time in a week or so and do run profiling using runsnake (http://www.vrplumber.com/programming/runsnakerun/). And test the new parser from Wibowo and report back. ;-) With plaintext I actually meant more some tabular output format which would be enough for my purposes (match and query coordinates, scores, gaps, identities). Martin Fields, Christopher J wrote: > On Sep 13, 2012, at 7:37 PM, Michiel de Hoon > wrote: > >> --- On Thu, 9/13/12, Martin Mokrejs wrote: >>> P.S.: And yes, I would love to parse blastn plaintext output >>> or some other more compact one, the XML is really an overkill. >> >> What exactly is the advantage of plain text parsing compared to XML? File size? >> >> Best, >> -Michiel. > > There isn't any. In fact, NCBI has consistently stated that one should never rely on parsing BLAST text output, primarily b/c they reserve the right to make changes to the output at any given point, whereas XML output should remain stable. As someone who has taken care of legacy BLAST code for a number of years (BioPerl), I can state that is fairly close to the truth (the caveat being they have made changes that break some XML parsing, but they do try to fix them). BLAST XML has simply been much easier to deal with in terms of fixing issues than text. > > chris > > From p.j.a.cock at googlemail.com Fri Sep 14 08:31:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Sep 2012 09:31:31 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052E6DA.7080703@fold.natur.cuni.cz> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs wrote: > Hi all, > as a long-term subscriber to this list and bioperl in the past as well I do know > that the plaintext output is being changed silently and that it is a hassle to > maintainers. On the other hand, the XML tags and syntax is way too verbose. > That in turn means lots of disc&memory IO, long parsing times and of course file size. > At least if the XML tags would be scrambled to be shorter strings. ;-) > Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: > https://redmine.open-bio.org/issues/3354 Earlier this week the NCBI released BLAST 2.2.27+ which might fix this... > A real case. An SFF file has 288MB in size. An extracted FASTA file with 180271 > sequences takes 60MB. Low-complexity masking takes 6 minutes. Legacy blastn search > using 59 queries through dataset that takes 17 minutes and yields XML with 3957MB > in size. Parsing the XML file through biopython takes 56 minutes to convert the > results into my own CSV file (some overhead could be my program, sure). Doing > a full Smith-Waterman search using 8 queries takes just 126 minutes. The times > are from filestamps so it is a wall-clock time. I will try to find some time in > a week or so and do run profiling using runsnake > (http://www.vrplumber.com/programming/runsnakerun/). > And test the new parser from Wibowo and report back. ;-) Great :) > With plaintext I actually meant more some tabular output format which would > be enough for my purposes (match and query coordinates, scores, gaps, identities). > I find the BLAST+ tabular output very useful - you can control which columns you get if the default 12 are not enough - and trivial to parse. This is also supported in Bow's SearchIO branch. Peter From mjldehoon at yahoo.com Fri Sep 14 09:27:50 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 14 Sep 2012 02:27:50 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> Hi Martin, --- On Fri, 9/14/12, Martin Mokrejs wrote: > Legacy blastn search using 59 queries through dataset > that takes 17 minutes and yields XML with 3957MB > in size. Parsing the XML file through biopython takes 56 > minutes to convert the results into my own CSV file How does this compare to parsing human-readable plain text output? Is it significantly faster than the XML parser? > With plaintext I actually meant more some tabular > output format which would be enough for my purposes > (match and query coordinates, scores, gaps, identities). Maintaining the tabular Blast output parser has not been a problem, and I expect that it will continue to be supported in Biopython. On the other hand, maintaining the human-readable plain text parser has been a recurring headache. If Biopython can parse tabular Blast output, then do you still need the human-readable plain text parser? Best, -Michiel. From mmokrejs at fold.natur.cuni.cz Fri Sep 14 09:47:48 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 11:47:48 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1347614870.37430.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: <5052FD44.3040302@fold.natur.cuni.cz> Hi Michiel, Michiel de Hoon wrote: > Hi Martin, > > --- On Fri, 9/14/12, Martin Mokrejs wrote: >> Legacy blastn search using 59 queries through dataset >> that takes 17 minutes and yields XML with 3957MB >> in size. Parsing the XML file through biopython takes 56 >> minutes to convert the results into my own CSV file > > How does this compare to parsing human-readable plain text output? Is > it significantly faster than the XML parser? I don't have numbers but say mdust program (compiled from C) parsed the FASTA file in 6 minutes so I would be happy with roughly same time needed for parsing a CSV file having at about 1/5 of the lines in the FASTA file. Biopython is using generators and I do that as well in my program so the main overhead in my program is string slicing, string to int/float/list conversion. > >> With plaintext I actually meant more some tabular >> output format which would be enough for my purposes >> (match and query coordinates, scores, gaps, identities). > > Maintaining the tabular Blast output parser has not been a problem, > and I expect that it will continue to be supported in Biopython. On > the other hand, maintaining the human-readable plain text parser has > been a recurring headache. If Biopython can parse tabular Blast > output, then do you still need the human-readable plain text parser? Sometimes I parsed the alignment to have in hands number of matches, mismatches (the pipes, minuses, dots) but not at this very moment. Their distribution along the alignment is important and sometimes helpful. BTW, I hate that blastn is changing letter-casing os the sequence in its output. ;-) Martin From mmokrejs at fold.natur.cuni.cz Fri Sep 14 09:52:24 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Sep 2012 11:52:24 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> Message-ID: <5052FE58.8060000@fold.natur.cuni.cz> Hi Peter, Peter Cock wrote: > On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs > wrote: >> Hi all, >> as a long-term subscriber to this list and bioperl in the past as well I do know >> that the plaintext output is being changed silently and that it is a hassle to >> maintainers. On the other hand, the XML tags and syntax is way too verbose. >> That in turn means lots of disc&memory IO, long parsing times and of course file size. >> At least if the XML tags would be scrambled to be shorter strings. ;-) >> Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: >> https://redmine.open-bio.org/issues/3354 > > Earlier this week the NCBI released BLAST 2.2.27+ which might > fix this... > ... > > I find the BLAST+ tabular output very useful - you can control which > columns you get if the default 12 are not enough - and trivial to parse. > This is also supported in Bow's SearchIO branch. Based on the 2.2.27 number you seem to talk about old/legacy blast ... but the plus means the new blast from NCBI? I don't like the new blast, it just gives different=bad results and I just don't have time to make up a good bug report with testcases. :(( Will see what Wibowo's code. Well, the XML result is same I think from both programs. Martin From p.j.a.cock at googlemail.com Fri Sep 14 10:00:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Sep 2012 11:00:33 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5052FE58.8060000@fold.natur.cuni.cz> References: <1347583028.33691.YahooMailClassic@web164005.mail.gq1.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <5052E6DA.7080703@fold.natur.cuni.cz> <5052FE58.8060000@fold.natur.cuni.cz> Message-ID: On Fri, Sep 14, 2012 at 10:52 AM, Martin Mokrejs wrote: > Hi Peter, > > Peter Cock wrote: >> On Fri, Sep 14, 2012 at 9:12 AM, Martin Mokrejs >> wrote: >>> Hi all, >>> as a long-term subscriber to this list and bioperl in the past as well I do know >>> that the plaintext output is being changed silently and that it is a hassle to >>> maintainers. On the other hand, the XML tags and syntax is way too verbose. >>> That in turn means lots of disc&memory IO, long parsing times and of course file size. >>> At least if the XML tags would be scrambled to be shorter strings. ;-) >>> Umm, I also hit a bug in legacy blastn XML output, still no answer from NCBI: >>> https://redmine.open-bio.org/issues/3354 >> >> Earlier this week the NCBI released BLAST 2.2.27+ which might >> fix this... >> > ... >> >> I find the BLAST+ tabular output very useful - you can control which >> columns you get if the default 12 are not enough - and trivial to parse. >> This is also supported in Bow's SearchIO branch. > > Based on the 2.2.27 number you seem to talk about old/legacy blast ... > but the plus means the new blast from NCBI? The NCBI call version "2.2.27" of the new C++ rewrite "BLAST v2.2.27+" (while personally I'd have called it BLAST+ v2.2.27 instead). The NCBI have now stopped updating legacy BLAST. > I don't like the new blast, it just gives different=bad results and I > just don't have time to make up a good bug report with testcases. :(( You are not alone in having problems/regressions with BLAST+ compared to legacy BLAST. I can think of several people still using 'blastall' for this reason. > Will see what Wibowo's code. Well, the XML result is same I > think from both programs. I think it is practically the same. Peter From mjldehoon at yahoo.com Sat Sep 15 02:43:12 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 14 Sep 2012 19:43:12 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> Message-ID: <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Last weekend I also talked with Peter during his visit to Tokyo about the Blast (human-readable) plain-text parser. We could see three scenarios in which the plain-text parser has an advantage over the XML parser (Peter please correct me if I am missing something from our discussion): 1) The file size of Blast plain-text output may be smaller than that of Blast XML output; 2) Users may want to look at the Blast output by eye in addition to parsing it with Biopython; 3) Users may have stacks of old Blast output files in plain-text format that they still want to use. Each of these points can be addressed without a Blast plain-text parser: 1) After zipping, we expect little difference in file size between plain-text output and XML output; 2) If we add a function to Biopython that generates Blast plain-text output (or something close to it) from Blast XML output, then a user can generate the Blast output in XML format, parse it with Biopython, optionally filter it, and then generate the corresponding plain-text output; 3) If this is really an issue, then we could create some standalone scripts (available from the Biopython website) that parses plain-text Blast output and generates the corresponding XML output. These scripts will be much easier than the current plain-text parser in Biopython, because we can create such a script for each version of Blast separately (of course this is only done if the need actually arises). The XML output can then be parsed by Biopython. Are there any other cases in which the plain-text parser is needed? Or where our proposed solutions to the three points above are not sufficient? If not, then I suggest we implement the plain-text generator in (2), and upgrade the PendingDeprecationWarning in Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. Best, -Michiel --- On Thu, 9/13/12, Fields, Christopher J wrote: > From: Fields, Christopher J > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "Michiel de Hoon" > Cc: "BioPython Mailing List" , "Martin Mokrejs" > Date: Thursday, September 13, 2012, 9:32 PM > On Sep 13, 2012, at 7:37 PM, Michiel > de Hoon > wrote: > > > --- On Thu, 9/13/12, Martin Mokrejs > wrote: > >> P.S.: And yes, I would love to parse blastn > plaintext output > >> or some other more compact one, the XML is really > an overkill. > > > > What exactly is the advantage of plain text parsing > compared to XML? File size? > > > > Best, > > -Michiel. > > There isn't any.? In fact, NCBI has consistently stated > that one should never rely on parsing BLAST text output, > primarily b/c they reserve the right to make changes to the > output at any given point, whereas XML output should remain > stable.? As someone who has taken care of legacy BLAST > code for a number of years (BioPerl), I can state that is > fairly close to the truth (the caveat being they have made > changes that break some XML parsing, but they do try to fix > them).? BLAST XML has simply been much easier to deal > with in terms of fixing issues than text. > > chris > > From p.j.a.cock at googlemail.com Sat Sep 15 10:37:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Sep 2012 11:37:50 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon wrote: > Last weekend I also talked with Peter during his visit to Tokyo about the > Blast (human-readable) plain-text parser. We could see three scenarios in > which the plain-text parser has an advantage over the XML parser (Peter > please correct me if I am missing something from our discussion): > > 1) The file size of Blast plain-text output may be smaller than that of > Blast XML output; > 2) Users may want to look at the Blast output by eye in addition to > parsing it with Biopython; > 3) Users may have stacks of old Blast output files in plain-text format > that they still want to use. Maybe also (3a) The user may want plain-text BLAST output to input into another tool as well as Biopython? > > Each of these points can be addressed without a Blast plain-text parser: > 1) After zipping, we expect little difference in file size between > plain-text output and XML output; However there would be a speed penalty - compression, then decompression, and perhaps in XML versus text parsing. > 2) If we add a function to Biopython that generates Blast plain-text > output (or something close to it) from Blast XML output, then a user can > generate the Blast output in XML format, parse it with Biopython, optionally > filter it, and then generate the corresponding plain-text output; The new 'SearchIO' results objects str/repr should be familiar to anyone who has looked at the plain text BLAST output - but not identical. We could apply some of these improvements to the current BLAST parsers, but I favour aiming to simply deprecate them in favour of 'SearchIO' (namespace to be decided). However, we certainly could try and offer a plain-text BLAST output format from 'SearchIO', although IIRC Bow has not tried that yet. It shouldn't be too complicated - unless you aim for 100% agreement with the latest BLAST output (moving target). > 3) If this is really an issue, then we could create some standalone > scripts (available from the Biopython website) that parses plain-text Blast > output and generates the corresponding XML output. These scripts will be > much easier than the current plain-text parser in Biopython, because we can > create such a script for each version of Blast separately (of course this is > only done if the need actually arises). The XML output can then be parsed by > Biopython. I was not convinced that this would actually save any effort over continuing to tweak the current (complex but flexible) plain text parser. > Are there any other cases in which the plain-text parser is needed? > Or where our proposed solutions to the three points above are not > sufficient? Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML, and (d) column rich tabular output might be worthwhile. There may be a case for parsing plain-text on the basis of speed. > If not, then I suggest we implement the plain-text generator in (2), > I certainly this adding plain-text output to 'SearchIO' would be useful. > and upgrade the PendingDeprecationWarning in > Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. Another idea we touched on was deprecating the current old, complex but flexible plain text parser while adding a new simpler plain text parser as part of 'SearchIO'. Here we could target only the recent BLAST+ output (and perhaps if not so different the final 'legacy' BLAST release), and not worry about all the variants the NCBI have produced over the years. I would hope this would also be faster [especially as currently 'SearchIO' supports parsing plain text BLAST on top of the existing old parser]. This boils down to a key question: How many people still want to use the plain-text output and why? I believe that for most use cases the tabular or XML output is better (covering simple needs, and full parsing of every detail respectively). e.g. It sounds like for Martin's example, the tabular output would be a perfect match. [Although, as I noted above, parsing the XML, especially if compressed, may not be as fast as parsing plain text?] While writing this email I was trying to recall when I last used the plain text output - and the only situation I could think of in the last year or so was in order to have something human readable to show a collaborator. Here XML to plain text BLAST would have been fine. Peter From p.j.a.cock at googlemail.com Sat Sep 15 10:49:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Sep 2012 11:49:59 +0100 Subject: [Biopython] Finally deprecating the plain text BLAST parser? Message-ID: Hello all, I've retitled this from Martin's thread initially about the BLAST XML parser: http://lists.open-bio.org/pipermail/biopython/2012-September/008154.html ... http://lists.open-bio.org/pipermail/biopython/2012-September/008164.html http://lists.open-bio.org/pipermail/biopython/2012-September/008165.html The topic shifted and an important question raised was: Should we finally deprecate the 'obsolete' plain text BLAST parser? So - is anyone on the list still using this file format, and why? [ Speak now or forever hold your peace ;) ] Thanks, Peter On Sat, Sep 15, 2012 at 11:37 AM, Peter Cock wrote: > On Sat, Sep 15, 2012 at 3:43 AM, Michiel de Hoon wrote: >> Last weekend I also talked with Peter during his visit to Tokyo about the >> Blast (human-readable) plain-text parser. We could see three scenarios in >> which the plain-text parser has an advantage over the XML parser (Peter >> please correct me if I am missing something from our discussion): >> >> 1) The file size of Blast plain-text output may be smaller than that of >> Blast XML output; >> 2) Users may want to look at the Blast output by eye in addition to >> parsing it with Biopython; >> 3) Users may have stacks of old Blast output files in plain-text format >> that they still want to use. > > Maybe also (3a) The user may want plain-text BLAST output to > input into another tool as well as Biopython? > >> >> Each of these points can be addressed without a Blast plain-text parser: >> 1) After zipping, we expect little difference in file size between >> plain-text output and XML output; > > However there would be a speed penalty - compression, then > decompression, and perhaps in XML versus text parsing. > >> 2) If we add a function to Biopython that generates Blast plain-text >> output (or something close to it) from Blast XML output, then a user can >> generate the Blast output in XML format, parse it with Biopython, optionally >> filter it, and then generate the corresponding plain-text output; > > The new 'SearchIO' results objects str/repr should be familiar to > anyone who has looked at the plain text BLAST output - but > not identical. We could apply some of these improvements > to the current BLAST parsers, but I favour aiming to simply > deprecate them in favour of 'SearchIO' (namespace to be > decided). > > However, we certainly could try and offer a plain-text BLAST > output format from 'SearchIO', although IIRC Bow has not tried > that yet. It shouldn't be too complicated - unless you aim for > 100% agreement with the latest BLAST output (moving target). > >> 3) If this is really an issue, then we could create some standalone >> scripts (available from the Biopython website) that parses plain-text Blast >> output and generates the corresponding XML output. These scripts will be >> much easier than the current plain-text parser in Biopython, because we can >> create such a script for each version of Blast separately (of course this is >> only done if the need actually arises). The XML output can then be parsed by >> Biopython. > > I was not convinced that this would actually save any effort over > continuing to tweak the current (complex but flexible) plain text > parser. > >> Are there any other cases in which the plain-text parser is needed? >> Or where our proposed solutions to the three points above are not >> sufficient? > > Benchmarking of parsing (a) plain text, (b) XML, (c) gzipped XML, > and (d) column rich tabular output might be worthwhile. There may > be a case for parsing plain-text on the basis of speed. > >> If not, then I suggest we implement the plain-text generator in (2), >> > > I certainly this adding plain-text output to 'SearchIO' would be > useful. > >> and upgrade the PendingDeprecationWarning in >> Bio.Blast.NCBIStandalone to a BiopythonDeprecationWarning. > > Another idea we touched on was deprecating the current old, > complex but flexible plain text parser while adding a new simpler > plain text parser as part of 'SearchIO'. Here we could target only > the recent BLAST+ output (and perhaps if not so different the > final 'legacy' BLAST release), and not worry about all the variants > the NCBI have produced over the years. I would hope this would > also be faster [especially as currently 'SearchIO' supports parsing > plain text BLAST on top of the existing old parser]. > > This boils down to a key question: How many people still want > to use the plain-text output and why? I believe that for most > use cases the tabular or XML output is better (covering simple > needs, and full parsing of every detail respectively). > > e.g. It sounds like for Martin's example, the tabular output would > be a perfect match. > > [Although, as I noted above, parsing the XML, especially if > compressed, may not be as fast as parsing plain text?] > > While writing this email I was trying to recall when I last used > the plain text output - and the only situation I could think of > in the last year or so was in order to have something human > readable to show a collaborator. Here XML to plain text BLAST > would have been fine. > > Peter From w.arindrarto at gmail.com Sat Sep 15 13:22:48 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 15 Sep 2012 15:22:48 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Hi guys, > > 2) If we add a function to Biopython that generates Blast plain-text > > output (or something close to it) from Blast XML output, then a user can > > generate the Blast output in XML format, parse it with Biopython, > > optionally > > filter it, and then generate the corresponding plain-text output; > > The new 'SearchIO' results objects str/repr should be familiar to > anyone who has looked at the plain text BLAST output - but > not identical. We could apply some of these improvements > to the current BLAST parsers, but I favour aiming to simply > deprecate them in favour of 'SearchIO' (namespace to be > decided). > > However, we certainly could try and offer a plain-text BLAST > output format from 'SearchIO', although IIRC Bow has not tried > that yet. It shouldn't be too complicated - unless you aim for > 100% agreement with the latest BLAST output (moving target). Yes, this has not been attempted ~ mostly because I feel that the BLAST plain text is indeed a moving target. But, if we are in favor of choosing one format from one BLAST version and always stick to it, it sounds more reasonable. There are one missing detail that is only present in the plain text format, though: the hit-level e-values. If we do decide to write a plain text writer, we either have to demand the user supply these values, or we omit the entire hit-level e-value table, or we fill it with something else. > Another idea we touched on was deprecating the current old, > complex but flexible plain text parser while adding a new simpler > plain text parser as part of 'SearchIO'. Here we could target only > the recent BLAST+ output (and perhaps if not so different the > final 'legacy' BLAST release), and not worry about all the variants > the NCBI have produced over the years. I would hope this would > also be faster [especially as currently 'SearchIO' supports parsing > plain text BLAST on top of the existing old parser]. This wasn't attempted as well, mostly because I feel that a lot of people still use legacy BLAST (we've had more legacy-BLAST related emails rather than BLAST+ ones in the past few months, I think). Also, the current parser wins on flexibility. I think the test cases include BLAST versions from 2002 (10 years ago!) up to BLAST 2.2.25+. So like Peter mentioned, the current SearchIO BLAST plain text parser is actually a simple wrapper over Bio.Blast.NCBIStandalone. We might be able to create a newer, speedier parser, but making it as flexible as our current one seems difficult. regards, Bow From mjldehoon at yahoo.com Sun Sep 16 13:54:37 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 16 Sep 2012 06:54:37 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Hi Bow, Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast. Thanks, -Michiel. --- On Sat, 9/15/12, Wibowo Arindrarto wrote: > From: Wibowo Arindrarto > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "BioPython Mailing List" > Date: Saturday, September 15, 2012, 9:22 AM > Hi guys, > > > > 2) If we add a function to Biopython that > generates Blast plain-text > > > output (or something close to it) from Blast XML > output, then a user can > > > generate the Blast output in XML format, parse it > with Biopython, > > > optionally > > > filter it, and then generate the corresponding > plain-text output; > > > > The new 'SearchIO' results objects str/repr should be > familiar to > > anyone who has looked at the plain text BLAST output - > but > > not identical. We could apply some of these > improvements > > to the current BLAST parsers, but I favour aiming to > simply > > deprecate them in favour of 'SearchIO' (namespace to > be > > decided). > > > > However, we certainly could try and offer a plain-text > BLAST > > output format from 'SearchIO', although IIRC Bow has > not tried > > that yet. It shouldn't be too complicated - unless you > aim for > > 100% agreement with the latest BLAST output (moving > target). > > Yes, this has not been attempted ~ mostly because I feel > that the > BLAST plain text is indeed a moving target. But, if we are > in favor of > choosing one format from one BLAST version and always stick > to it, it > sounds more reasonable. > > There are one missing detail that is only present in the > plain text > format, though: the hit-level e-values. If we do decide to > write a > plain text writer, we either have to demand the user supply > these > values, or we omit the entire hit-level e-value table, or we > fill it > with something else. > > > Another idea we touched on was deprecating the current > old, > > complex but flexible plain text parser while adding a > new simpler > > plain text parser as part of 'SearchIO'. Here we could > target only > > the recent BLAST+ output (and perhaps if not so > different the > > final 'legacy' BLAST release), and not worry about all > the variants > > the NCBI have produced over the years. I would hope > this would > > also be faster [especially as currently 'SearchIO' > supports parsing > > plain text BLAST on top of the existing old parser]. > > This wasn't attempted as well, mostly because I feel that a > lot of > people still use legacy BLAST (we've had more legacy-BLAST > related > emails rather than BLAST+ ones in the past few months, I > think). Also, > the current parser wins on flexibility. I think the test > cases include > BLAST versions from 2002 (10 years ago!) up to BLAST > 2.2.25+. So like > Peter mentioned, the current SearchIO BLAST plain text > parser is > actually a simple wrapper over Bio.Blast.NCBIStandalone. > > We might be able to create a newer, speedier parser, but > making it as > flexible as our current one seems difficult. > > regards, > Bow > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Sun Sep 16 14:21:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 16 Sep 2012 16:21:52 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1347803677.78564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hi Michiel, We have a draft tutorial that I'm temporarily hosting here: http://bow.web.id/biopython/Tutorial.html#htoc96. The internal functions have also been documented with docstrings and quick examples (e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py). At the moment, the SearchIO API is very similar to SeqIO and AlignIO, though in the future this is still subject to change. Hope this helps :), otherwise let me know which part is specifically unclear for you. regards, Bow On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon wrote: > Hi Bow, > > Is there some documentation somewhere for the SearchIO module? I have a hard time understanding what it does and how it relates to Blast. > > Thanks, > -Michiel. > > --- On Sat, 9/15/12, Wibowo Arindrarto wrote: > >> From: Wibowo Arindrarto >> Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? >> To: "BioPython Mailing List" >> Date: Saturday, September 15, 2012, 9:22 AM >> Hi guys, >> >> > > 2) If we add a function to Biopython that >> generates Blast plain-text >> > > output (or something close to it) from Blast XML >> output, then a user can >> > > generate the Blast output in XML format, parse it >> with Biopython, >> > > optionally >> > > filter it, and then generate the corresponding >> plain-text output; >> > >> > The new 'SearchIO' results objects str/repr should be >> familiar to >> > anyone who has looked at the plain text BLAST output - >> but >> > not identical. We could apply some of these >> improvements >> > to the current BLAST parsers, but I favour aiming to >> simply >> > deprecate them in favour of 'SearchIO' (namespace to >> be >> > decided). >> > >> > However, we certainly could try and offer a plain-text >> BLAST >> > output format from 'SearchIO', although IIRC Bow has >> not tried >> > that yet. It shouldn't be too complicated - unless you >> aim for >> > 100% agreement with the latest BLAST output (moving >> target). >> >> Yes, this has not been attempted ~ mostly because I feel >> that the >> BLAST plain text is indeed a moving target. But, if we are >> in favor of >> choosing one format from one BLAST version and always stick >> to it, it >> sounds more reasonable. >> >> There are one missing detail that is only present in the >> plain text >> format, though: the hit-level e-values. If we do decide to >> write a >> plain text writer, we either have to demand the user supply >> these >> values, or we omit the entire hit-level e-value table, or we >> fill it >> with something else. >> >> > Another idea we touched on was deprecating the current >> old, >> > complex but flexible plain text parser while adding a >> new simpler >> > plain text parser as part of 'SearchIO'. Here we could >> target only >> > the recent BLAST+ output (and perhaps if not so >> different the >> > final 'legacy' BLAST release), and not worry about all >> the variants >> > the NCBI have produced over the years. I would hope >> this would >> > also be faster [especially as currently 'SearchIO' >> supports parsing >> > plain text BLAST on top of the existing old parser]. >> >> This wasn't attempted as well, mostly because I feel that a >> lot of >> people still use legacy BLAST (we've had more legacy-BLAST >> related >> emails rather than BLAST+ ones in the past few months, I >> think). Also, >> the current parser wins on flexibility. I think the test >> cases include >> BLAST versions from 2002 (10 years ago!) up to BLAST >> 2.2.25+. So like >> Peter mentioned, the current SearchIO BLAST plain text >> parser is >> actually a simple wrapper over Bio.Blast.NCBIStandalone. >> >> We might be able to create a newer, speedier parser, but >> making it as >> flexible as our current one seems difficult. >> >> regards, >> Bow >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From mjldehoon at yahoo.com Sun Sep 16 16:24:36 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 16 Sep 2012 09:24:36 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Bow, Thanks for the links! This is actually the first time I looked at the SearchIO module in detail. I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already. So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast. Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object. Best, -Michiel. --- On Sun, 9/16/12, Wibowo Arindrarto wrote: > From: Wibowo Arindrarto > Subject: Re: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? > To: "Michiel de Hoon" > Cc: "BioPython Mailing List" > Date: Sunday, September 16, 2012, 10:21 AM > Hi Michiel, > > We have a draft tutorial that I'm temporarily hosting here: > http://bow.web.id/biopython/Tutorial.html#htoc96. The > internal > functions have also been documented with docstrings and > quick examples > (e.g. https://github.com/bow/biopython/blob/searchio/Bio/SearchIO/__init__.py). > > At the moment, the SearchIO API is very similar to SeqIO and > AlignIO, > though in the future this is still subject to change. > > Hope this helps :), otherwise let me know which part is > specifically > unclear for you. > > regards, > Bow > > On Sun, Sep 16, 2012 at 3:54 PM, Michiel de Hoon > wrote: > > Hi Bow, > > > > Is there some documentation somewhere for the SearchIO > module? I have a hard time understanding what it does and > how it relates to Blast. > > > > Thanks, > > -Michiel. > > > > --- On Sat, 9/15/12, Wibowo Arindrarto > wrote: > > > >> From: Wibowo Arindrarto > >> Subject: Re: [Biopython] Legacy blastn XML outfile > parsing is slow. What XML parser is actually used? > >> To: "BioPython Mailing List" > >> Date: Saturday, September 15, 2012, 9:22 AM > >> Hi guys, > >> > >> > > 2) If we add a function to Biopython > that > >> generates Blast plain-text > >> > > output (or something close to it) from > Blast XML > >> output, then a user can > >> > > generate the Blast output in XML format, > parse it > >> with Biopython, > >> > > optionally > >> > > filter it, and then generate the > corresponding > >> plain-text output; > >> > > >> > The new 'SearchIO' results objects str/repr > should be > >> familiar to > >> > anyone who has looked at the plain text BLAST > output - > >> but > >> > not identical. We could apply some of these > >> improvements > >> > to the current BLAST parsers, but I favour > aiming to > >> simply > >> > deprecate them in favour of 'SearchIO' > (namespace to > >> be > >> > decided). > >> > > >> > However, we certainly could try and offer a > plain-text > >> BLAST > >> > output format from 'SearchIO', although IIRC > Bow has > >> not tried > >> > that yet. It shouldn't be too complicated - > unless you > >> aim for > >> > 100% agreement with the latest BLAST output > (moving > >> target). > >> > >> Yes, this has not been attempted ~ mostly because I > feel > >> that the > >> BLAST plain text is indeed a moving target. But, if > we are > >> in favor of > >> choosing one format from one BLAST version and > always stick > >> to it, it > >> sounds more reasonable. > >> > >> There are one missing detail that is only present > in the > >> plain text > >> format, though: the hit-level e-values. If we do > decide to > >> write a > >> plain text writer, we either have to demand the > user supply > >> these > >> values, or we omit the entire hit-level e-value > table, or we > >> fill it > >> with something else. > >> > >> > Another idea we touched on was deprecating the > current > >> old, > >> > complex but flexible plain text parser while > adding a > >> new simpler > >> > plain text parser as part of 'SearchIO'. Here > we could > >> target only > >> > the recent BLAST+ output (and perhaps if not > so > >> different the > >> > final 'legacy' BLAST release), and not worry > about all > >> the variants > >> > the NCBI have produced over the years. I would > hope > >> this would > >> > also be faster [especially as currently > 'SearchIO' > >> supports parsing > >> > plain text BLAST on top of the existing old > parser]. > >> > >> This wasn't attempted as well, mostly because I > feel that a > >> lot of > >> people still use legacy BLAST (we've had more > legacy-BLAST > >> related > >> emails rather than BLAST+ ones in the past few > months, I > >> think). Also, > >> the current parser wins on flexibility. I think the > test > >> cases include > >> BLAST versions from 2002 (10 years ago!) up to > BLAST > >> 2.2.25+. So like > >> Peter mentioned, the current SearchIO BLAST plain > text > >> parser is > >> actually a simple wrapper over > Bio.Blast.NCBIStandalone. > >> > >> We might be able to create a newer, speedier > parser, but > >> making it as > >> flexible as our current one seems difficult. > >> > >> regards, > >> Bow > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > From w.arindrarto at gmail.com Sun Sep 16 17:44:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 16 Sep 2012 19:44:23 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi Michiel, > Thanks for the links! This is actually the first time I looked at the SearchIO module in detail. You're welcome :). > I noticed that there is a large overlap between the functionality in Bio.Blast and the SearchIO module. We should definitely avoid having two sets of Blast parsers; as the recent discussion shows, one set of Blast parsers is hard enough already. > > So I would strongly suggest to integrate the SearchIO module with Bio.Blast. Here "integrate" could mean as little as using the Bio.Blast name space, and making sure we don't lose any functionality. (Or we could pick a better name than Bio.Blast, since SearchIO also includes blat, exonerate, etc.; but since Blast is the most important one perhaps using Bio.Blast for all of them is OK). The final outcome would then be that the parsers currently in SearchIO will replace the parsers currently in Bio.Blast. The plan that Peter and I discussed was indeed to eventually deprecate Bio.Blast in favor of SearchIO. I prefer not to use Bio.Blast precisely for the reason you mentioned. I think we last discussed that we may use Bio.Seq.Search as the name (or bio.seq.search, after we settled on the namespace). Also, the bio.seq.search (or whatever we will call it) module will have wrappers for sequence search command line and web tools. Of course, this won't be for BLAST only. In another branch, I've written a draft HMMER wrapper and a partial BLAT wrapper. For the web tool, the HMMER devs also have a web service for which we could create a wrapper. > Also I noticed that SearchIO (like Bio.Blast) uses attributes to store information. I would much rather see a dictionary-like interface. This has the advantage that we can keep the key name much closer to what is in the original file (for example, no need to replace '-' by '_'), and also users can call .keys() to find out what is stored in the object. > > Best, > -Michiel. If you are talking about using the slice notation to retrieve object attributes, that could be difficult for users. Most of the current SearchIO objects are themselves containers of other objects (the object model is nested). I could try implementing some hacks so that the attributes are stored in a dictionary, but I think this would confuse users when they use the slice notation (am I retrieving an attribute or a nested SearchIO object?). Maybe what you have in mind is a single dictionary stored as an object attribute as the interface? For example, we could have object.attribs as the dictionary and we could use object.attribs['e-value'] for example). We do gain '-' instead of '_' and `.keys()` using this, but at the cost of brevity, so I have a mixed feeling towards this. If users want to find out what the attributes are, they can use object.__dict__.keys(). I could try create a common property (e.g. object.attrib_names) that returns a list of all available attribute names for a given object. But for now, this seems a little bit too excessive for me (could be done if more people desire otherwise, though). Thanks for taking a look, by the way. Always appreciate a new set of fresh perspectives :). regards, Bow From p.j.a.cock at googlemail.com Sun Sep 16 19:17:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Sep 2012 20:17:12 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347812676.94107.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Sun, Sep 16, 2012 at 5:24 PM, Michiel de Hoon wrote: > > Also I noticed that SearchIO (like Bio.Blast) uses attributes > to store information. I would much rather see a dictionary-like > interface. This has the advantage that we can keep the key > name much closer to what is in the original file (for example, > no need to replace '-' by '_'), and also users can call .keys() > to find out what is stored in the object. I don't see a dictionary as being inherently easier to use. You also use dir(obj) to see the attributes, which are more flexible as you can implement them as properties and have code behind them if needed. Another key point is we can add docstrings to attributes/properties to give help text - and you can't do that with a dictionary key. Also different file formats use different terms for what is really the same idea - I envisioned SearchIO as a unified parser, which means imposing a common naming convention for these key fields. I also think that certain core bits of information common to BLAST, HMMER, etc should be exposed at the property level (including query match names and co-ordinates). Here we're going to standardise start/end values to integers using Python counting, consistent strand notation etc. As in the SeqRecord and SeqFeature, a dictionary makes perfect sense for general 'free form' information. And this approach is used here too. Regards, Peter From semenko at alum.mit.edu Mon Sep 17 17:01:00 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:01:00 -0500 Subject: [Biopython] Error parsing EMBL file Message-ID: I'm trying to extract the peptide sequences from a large collection of EMBL-formatted files (all phage & virus data from EBI). EBI provides these as large, concatenated EMBL files, so I've been using SeqIO.parse to read & then write the 'translation' key from seq_feature.qualifiers. Unfortunately, it looks like the parser dies on one input file: http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true Traceback (most recent call last): File "gbk_to_faa.py", line 7, in for seq_record in SeqIO.parse(input_handle, "embl") : File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse for r in i: File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 440, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 692, in _feed_header_lines consumer.reference_bases("(bases %s)" % "; ".join(parts)) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 740, in reference_bases locations = self._split_reference_locations(ref_base_info) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 777, in _split_reference_locations start, end = base_info.split('to') ValueError: need more than 1 value to unpack * I might dig into this a bit more to patch, but does anyone more familiar with EMBL files know what's going on? * Also, is there are more straightforward (or even non-BioPython way) to go from EMBL->FAA? Best, Nick -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From semenko at alum.mit.edu Mon Sep 17 17:22:26 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:22:26 -0500 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References: Message-ID: Looks like it's dying at a line-wrapped location string: RN [16] RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, RP 41454-41724 RX DOI; 10.1128/JB.185.4.1475-1477.2003. RX PUBMED; 12562822. RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., RA Casjens S.R.; RT "Corrected sequence of the bacteriophage p22 genome"; RL J. Bacteriol. 185(4):1475-1477(2003). This works if RP is just one line: RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 On Mon, Sep 17, 2012 at 12:01 PM, Nick Semenkovich wrote: > I'm trying to extract the peptide sequences from a large collection of > EMBL-formatted files (all phage & virus data from EBI). > > EBI provides these as large, concatenated EMBL files, so I've been > using SeqIO.parse to read & then write the 'translation' key from > seq_feature.qualifiers. > > > Unfortunately, it looks like the parser dies on one input file: > > http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true > > Traceback (most recent call last): > File "gbk_to_faa.py", line 7, in > for seq_record in SeqIO.parse(input_handle, "embl") : > File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse > for r in i: > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line > 440, in parse_records > record = self.parse(handle, do_features) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse > if self.feed(handle, consumer, do_features): > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed > self._feed_header_lines(consumer, self.parse_header()) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line > 692, in _feed_header_lines > consumer.reference_bases("(bases %s)" % "; ".join(parts)) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line > 740, in reference_bases > locations = self._split_reference_locations(ref_base_info) > File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line > 777, in _split_reference_locations > start, end = base_info.split('to') > ValueError: need more than 1 value to unpack > > > * I might dig into this a bit more to patch, but does anyone more > familiar with EMBL files know what's going on? > > * Also, is there are more straightforward (or even non-BioPython way) > to go from EMBL->FAA? > > > Best, > Nick > > -- > Nick Semenkovich > Laboratory of Dr. Jeffrey I. Gordon > Medical Scientist Training Program > School of Medicine > Washington University in St. Louis > 314.362.3963 (Lab) > http://web.mit.edu/semenko/ -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Mon Sep 17 17:31:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Sep 2012 18:31:38 +0100 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References:

Message-ID: On Mon, Sep 17, 2012 at 6:22 PM, Nick Semenkovich wrote: > Looks like it's dying at a line-wrapped location string: > > RN [16] > RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, > RP 41454-41724 > RX DOI; 10.1128/JB.185.4.1475-1477.2003. > RX PUBMED; 12562822. > RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., > RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., > RA Casjens S.R.; > RT "Corrected sequence of the bacteriophage p22 genome"; > RL J. Bacteriol. 185(4):1475-1477(2003). > > > This works if RP is just one line: > RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 Good detective work :) Can you try with this fix? https://github.com/biopython/biopython/commit/0da9d7e72a95fe788c7c32c9cbc2ac95d84bb7b7 If you installed from source, the easiest way would be to grab the latest code from git and reinstall. If you installed from a package, perhaps you might prefer to manually hack the file to make the one line change by hand? Back it up first ;) /usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py Peter From semenko at alum.mit.edu Mon Sep 17 17:36:16 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Mon, 17 Sep 2012 12:36:16 -0500 Subject: [Biopython] Error parsing EMBL file In-Reply-To: References:

Message-ID: Awesome -- works great! (virtualenv makes this so easy!) Thanks for the quick patch! - Nick On Mon, Sep 17, 2012 at 12:31 PM, Peter Cock wrote: > On Mon, Sep 17, 2012 at 6:22 PM, Nick Semenkovich wrote: >> Looks like it's dying at a line-wrapped location string: >> >> RN [16] >> RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580, >> RP 41454-41724 >> RX DOI; 10.1128/JB.185.4.1475-1477.2003. >> RX PUBMED; 12562822. >> RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W., >> RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A., >> RA Casjens S.R.; >> RT "Corrected sequence of the bacteriophage p22 genome"; >> RL J. Bacteriol. 185(4):1475-1477(2003). >> >> >> This works if RP is just one line: >> RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724 > > Good detective work :) > > Can you try with this fix? > https://github.com/biopython/biopython/commit/0da9d7e72a95fe788c7c32c9cbc2ac95d84bb7b7 > > If you installed from source, the easiest way would be to grab the latest > code from git and reinstall. > > If you installed from a package, perhaps you might prefer to manually > hack the file to make the one line change by hand? Back it up first ;) > /usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py > > Peter -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Tue Sep 18 09:52:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Sep 2012 10:52:13 +0100 Subject: [Biopython] Removing HotRand.py? In-Reply-To: References: Message-ID: On Tue, Sep 11, 2012 at 2:18 AM, Nick Semenkovich wrote: > I've submitted a pull request to deprecate Bio/HotRand.py > > Is anyone still using the HotRandom functions? > > > It looks like the module is pretty old and there are some better alternatives: > http://pypi.python.org/pypi/randomdotorg/ > > > Pull Request: https://github.com/biopython/biopython/pull/69 > > Best, > Nick Semenkovich Since no one has objected, I will commit this. Our gradual deprecation process means there is still time to reverse this and/or delay the removal of Bio.HotRand if needed. Thanks, Peter From cfriedline at vcu.edu Tue Sep 18 21:34:11 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 18 Sep 2012 17:34:11 -0400 Subject: [Biopython] multiprocessing and SeqIO.index_db() Message-ID: Hi, I ran into this today, and wondering if there is a work around. If I attempt to index multiple files with multiprocessing using SeqIO.index_db(), I can create the databases, but I'm unable to open them after they come back from the async process. Instead, I get this when trying to (say) print the dictionary: File "/Users/chris/.virtualenvs/default/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 112, in __str__ return "{%s : SeqRecord(...), ...}" % repr(self.keys()[0]) File "/Users/chris/.virtualenvs/default/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 416, in keys self._con.execute("SELECT key FROM offset_data;").fetchall()] sqlite3.ProgrammingError: Base Connection.__init__ not called. As a workaround, I'm just calling index_db again. From the source code, it appears that the index is not rebuilt by doing this, and it seems to work OK. Is this just a multiprocessing/pickling issue? Thanks, Chris From eric.talevich at gmail.com Wed Sep 19 18:10:31 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 19 Sep 2012 14:10:31 -0400 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 9:22 AM, Wibowo Arindrarto wrote: > Hi guys, > > > > 2) If we add a function to Biopython that generates Blast plain-text > > > output (or something close to it) from Blast XML output, then a user > can > > > generate the Blast output in XML format, parse it with Biopython, > > > optionally > > > filter it, and then generate the corresponding plain-text output; > > > > The new 'SearchIO' results objects str/repr should be familiar to > > anyone who has looked at the plain text BLAST output - but > > not identical. We could apply some of these improvements > > to the current BLAST parsers, but I favour aiming to simply > > deprecate them in favour of 'SearchIO' (namespace to be > > decided). > > > > However, we certainly could try and offer a plain-text BLAST > > output format from 'SearchIO', although IIRC Bow has not tried > > that yet. It shouldn't be too complicated - unless you aim for > > 100% agreement with the latest BLAST output (moving target). > > Yes, this has not been attempted ~ mostly because I feel that the > BLAST plain text is indeed a moving target. But, if we are in favor of > choosing one format from one BLAST version and always stick to it, it > sounds more reasonable. > Since NCBI is not planning to make any more changes to "legacy" blastall, this could be an opportunity to settle on once stable plain-text BLAST output style to parse in Bio.Search(IO), and admit that we're not going to bother keeping up with BLAST+ plain-text reports. (I imagine there's a certain degree of overlap between users stuck with legacy BLAST installations and those stuck with plain-text BLAST reports.) > > There are one missing detail that is only present in the plain text > format, though: the hit-level e-values. If we do decide to write a > plain text writer, we either have to demand the user supply these > values, or we omit the entire hit-level e-value table, or we fill it > with something else. > But the Hsp-level scores or bitscores are included, right? The database size, query length and Alschul-Karlin kappa and lambda values are included in the BLAST XML output, so it's possible (and not difficult) to recalculate the e-values. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head3 Note that BLAST tweaks the raw alignment score with their own heuristics, so it's not easy to get the raw score from the alignment in the XML. But once you have the raw score, the rest is straightforward. Cheers, Eric From p.j.a.cock at googlemail.com Fri Sep 21 13:22:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Sep 2012 14:22:48 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Sat, Sep 15, 2012 at 2:22 PM, Wibowo Arindrarto wrote: > Hi guys, > >> > 2) If we add a function to Biopython that generates Blast plain-text >> > output (or something close to it) from Blast XML output, then a user can >> > generate the Blast output in XML format, parse it with Biopython, >> > optionally >> > filter it, and then generate the corresponding plain-text output; >> >> The new 'SearchIO' results objects str/repr should be familiar to >> anyone who has looked at the plain text BLAST output - but >> not identical. We could apply some of these improvements >> to the current BLAST parsers, but I favour aiming to simply >> deprecate them in favour of 'SearchIO' (namespace to be >> decided). >> >> However, we certainly could try and offer a plain-text BLAST >> output format from 'SearchIO', although IIRC Bow has not tried >> that yet. It shouldn't be too complicated - unless you aim for >> 100% agreement with the latest BLAST output (moving target). > > Yes, this has not been attempted ~ mostly because I feel that the > BLAST plain text is indeed a moving target. But, if we are in favor of > choosing one format from one BLAST version and always stick to it, it > sounds more reasonable. > > There are one missing detail that is only present in the plain text > format, though: the hit-level e-values. If we do decide to write a > plain text writer, we either have to demand the user supply these > values, or we omit the entire hit-level e-value table, or we fill it > with something else. Bow and I have just been over the BLAST+ source code, and confirmed the 'hit level e-value' shown in the plain text description table before the alignments is in fact just the e-value of the best HSP. i.e. The minimum e-value. So that isn't a problem afterall. Peter From w.arindrarto at gmail.com Fri Sep 21 23:03:10 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 22 Sep 2012 01:03:10 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF33BAEDEB@CHIMBX5.ad.uillinois.edu> <1347676992.27319.YahooMailClassic@web164005.mail.gq1.yahoo.com>

Message-ID: Hi guys, On Fri, Sep 21, 2012 at 3:22 PM, Peter Cock wrote: > On Sat, Sep 15, 2012 at 2:22 PM, Wibowo Arindrarto > wrote: >> Hi guys, >> >>> > 2) If we add a function to Biopython that generates Blast plain-text >>> > output (or something close to it) from Blast XML output, then a user can >>> > generate the Blast output in XML format, parse it with Biopython, >>> > optionally >>> > filter it, and then generate the corresponding plain-text output; >>> >>> The new 'SearchIO' results objects str/repr should be familiar to >>> anyone who has looked at the plain text BLAST output - but >>> not identical. We could apply some of these improvements >>> to the current BLAST parsers, but I favour aiming to simply >>> deprecate them in favour of 'SearchIO' (namespace to be >>> decided). >>> >>> However, we certainly could try and offer a plain-text BLAST >>> output format from 'SearchIO', although IIRC Bow has not tried >>> that yet. It shouldn't be too complicated - unless you aim for >>> 100% agreement with the latest BLAST output (moving target). >> >> Yes, this has not been attempted ~ mostly because I feel that the >> BLAST plain text is indeed a moving target. But, if we are in favor of >> choosing one format from one BLAST version and always stick to it, it >> sounds more reasonable. >> >> There are one missing detail that is only present in the plain text >> format, though: the hit-level e-values. If we do decide to write a >> plain text writer, we either have to demand the user supply these >> values, or we omit the entire hit-level e-value table, or we fill it >> with something else. > > Bow and I have just been over the BLAST+ source code, > and confirmed the 'hit level e-value' shown in the plain text > description table before the alignments is in fact just the > e-value of the best HSP. i.e. The minimum e-value. > > So that isn't a problem afterall. > > Peter Yes, I should've checked first how that e-value gets there. A little peeking into the source code and it was apparent that it's the lowest HSP-level e-value in the hit. So we don't have to worry about calculating new values. For the writing support, I agree with Eric ~ we could use the latest BLAST legacy output as our target plain text format. For parsing, I'm still not sure. Unless there's a massive speed-up, I prefer to keep the current parser as the base given its versatility. Perhaps I can do a bit more 'trimming' so that the parser directly creates SearchIO objects. This won't be a major change to the logic, though. regards, Bow From mmokrejs at fold.natur.cuni.cz Tue Sep 25 01:28:36 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 03:28:36 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> Message-ID: <506108C4.7010102@fold.natur.cuni.cz> Hi Wibowo, will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? I would have to lookup how the record attributes changed(=renamed) from those specific for blast to those generalized and used(=promoted) by SearchIO. Do you have a list of sed regexps? ;-) Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler running overnight and will at least be able to lookup where the bottleneck in the current NCBIXML is. The rest ... next time. ;-) I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing *additionally* the data through "old" names? So that "SearchIO" would expose both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat or hmmer parsers expose so far other attributes as well. I know it is ugly but makes the transition smoother. ;-) Those are just references. ;-) I would like to do something like: try: # latest and greatest biopython version installed from Bio import SearchIO except ImportError: # some old installation from Bio import SeqIO but have the rest of my code unchanged. Umm, I use NCBIXML.parse() so I won't need even the above. You just change it in your git branch and I won't have to touch my code. That's fair, isn't it? ;-) Did you profile biopython or SearchIO yourself? Best, Martin Wibowo Arindrarto wrote: > Hi Martin, > > There is actually already a faster BLAST XML parser written using > cElementTree in Biopython :) (although it's yet to be included in the > main branch). It's part of Biopython's SearchIO module that I recently > wrote (the name SearchIO might change in the future). And indeed, my > early benchmarks has shown that it does perform faster. > > This branch is available here: > https://github.com/bow/biopython/tree/searchio. I've also written a > draft tutorial on how to use it here: > http://bow.web.id/biopython/Tutorial.html#htoc96. > > However, as it's not yet in the current branch, you need to do a > little bit of command line work to set it up: > > 1. Set up a new virtualenv environment (so that it doesn't clash with > your other Biopython installation) and activate it. > 2. Clone the repository: `git clone > https://github.com/bow/biopython.git`, checkout the 'searchio' branch > 3. Run `python setup.py develop`. This will keep the > installation in-sync with any future `git pull` you might perform on > the branch. > > Hope this helps :), > Bow > > > On Thu, Sep 13, 2012 at 5:20 PM, Martin Mokrejs > wrote: >> Hi, >> I am using "blastall -p blastn ... -m 7" to yield about 100GB large XML files >> which are then parsed by >> >> from Bio.Blast import NCBIXML >> _blastn_fileh = open(blast_out_xml_filename) >> _blastn_iterator = NCBIXML.parse(_blastn_fileh) >> _record = _blastn_iterator.next() # fetch the very first BLAST result from generator >> >> In my case the blastn searches seem to take longer than takes the XML parsing. :( >> I do not have timing numbers here but wonder why is cElementTree used only in Uniprot >> biopython modules and not in SeqIO. What XML parsing library is my biopython-1.59 using? >> Isn't there any argument when setup.py is called to discern between elementtree, cElementTree >> which I think use expat ...? I am writing this a bit from top of my head hoping Peter ;-) >> or somebody else will know right away where to look for a performance bottleneck >> and where to change code to use cElementTree which always seemed the fastest to me. >> Thank you for some initial advice. >> Martin >> P.S.: And yes, I would love to parse blastn plaintext output or some other more compact one, >> the XML is really an overkill. >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Tue Sep 25 08:09:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 09:09:26 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <506108C4.7010102@fold.natur.cuni.cz> References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs wrote: > Hi Wibowo, > will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? > I would have to lookup how the record attributes changed(=renamed) > from those specific for blast to those generalized and used(=promoted) > by SearchIO. Do you have a list of sed regexps? ;-) > > Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) > Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler > running overnight and will at least be able to lookup where the bottleneck > in the current NCBIXML is. The rest ... next time. ;-) We did discuss updating the internals of the old NCBIXML parser to use ElementTree / cElementTree, but currently the plan is to simply deprecate the old parser, so this seems a wasted effort. > I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing > *additionally* the data through "old" names? So that "SearchIO" would expose > both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat > or hmmer parsers expose so far other attributes as well. I know it is ugly but > makes the transition smoother. ;-) Those are just references. ;-) I would > like to do something like: > > try: > # latest and greatest biopython version installed > from Bio import SearchIO > except ImportError: > # some old installation > from Bio import SeqIO > > > but have the rest of my code unchanged. Umm, I use NCBIXML.parse() > so I won't need even the above. You just change it in your git branch > and I won't have to touch my code. That's fair, isn't it? ;-) The plan is to reward people for updating their code by giving them faster BLAST XML parsing (and an easy way to try out other input file formats in future). Note that Bio.SearchIO is the working name and current namespace used on the branch, but is unlikely to be the final name. And I'm not keen on adding backwards compatible aliases for the old BLAST parser names - even if they did come with deprecation warnings. In fact I suspect even that wouldn't give you the drop in replacement you are hoping for, the object heirachy has changed too. However, if there are some specific cases where you think the old name is still sensible given the broader scope of the new parser covering many other formats as well as BLAST, then some minor renames seems more reasonable. > Did you profile biopython or SearchIO yourself? > Best, > Martin Bow did some profiling of the old NCBIXML parser against his SearchIO work. Peter From w.arindrarto at gmail.com Tue Sep 25 09:53:08 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 25 Sep 2012 11:53:08 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: Hi Martin, Peter, I agree with Peter. The new object model is a bit different from the old one in Bio.Blast, so a simple search & replace might not do the trick. The same goes with the attribute names. I suppose I could add one table in the draft tutorial to list the new attribute names, but I prefer not to have any Bio.Blast-compatible names in the code. As for the profiling, I did some quick benchmarks but it wasn't really thorough. I only compared the parsing times of Bio.Blast.NCBIXML and the new BLAST XML parser in SearchIO. Using a test file containing 1000 BLAST queries (286 Mb total), the results were as follows: on SearchIO: 97.11 93.66 94.13 91.35 90.90 Total time : 467.15 Average : 93.43 on Bio.Blast: 441.45 412.57 471.31 434.22 429.35 Total time : 2188.90 Average : 437.78 The speed-up was almost 5x. I didn't check for any optimizable bottlenecks, though. Hope that helps, Bow On Tue, Sep 25, 2012 at 10:09 AM, Peter Cock wrote: > On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs > wrote: >> Hi Wibowo, >> will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? >> I would have to lookup how the record attributes changed(=renamed) >> from those specific for blast to those generalized and used(=promoted) >> by SearchIO. Do you have a list of sed regexps? ;-) >> >> Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) >> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler >> running overnight and will at least be able to lookup where the bottleneck >> in the current NCBIXML is. The rest ... next time. ;-) > > We did discuss updating the internals of the old NCBIXML parser > to use ElementTree / cElementTree, but currently the plan is to > simply deprecate the old parser, so this seems a wasted effort. > >> I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing >> *additionally* the data through "old" names? So that "SearchIO" would expose >> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat >> or hmmer parsers expose so far other attributes as well. I know it is ugly but >> makes the transition smoother. ;-) Those are just references. ;-) I would >> like to do something like: >> >> try: >> # latest and greatest biopython version installed >> from Bio import SearchIO >> except ImportError: >> # some old installation >> from Bio import SeqIO >> >> >> but have the rest of my code unchanged. Umm, I use NCBIXML.parse() >> so I won't need even the above. You just change it in your git branch >> and I won't have to touch my code. That's fair, isn't it? ;-) > > The plan is to reward people for updating their code by giving them > faster BLAST XML parsing (and an easy way to try out other input > file formats in future). > > Note that Bio.SearchIO is the working name and current namespace > used on the branch, but is unlikely to be the final name. > > And I'm not keen on adding backwards compatible aliases for the old > BLAST parser names - even if they did come with deprecation warnings. > In fact I suspect even that wouldn't give you the drop in replacement > you are hoping for, the object heirachy has changed too. > > However, if there are some specific cases where you think the old > name is still sensible given the broader scope of the new parser > covering many other formats as well as BLAST, then some minor > renames seems more reasonable. > >> Did you profile biopython or SearchIO yourself? >> Best, >> Martin > > Bow did some profiling of the old NCBIXML parser against his > SearchIO work. > > Peter From mjldehoon at yahoo.com Tue Sep 25 10:34:52 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 25 Sep 2012 03:34:52 -0700 (PDT) Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: Message-ID: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> > The same goes with the attribute names. I suppose I > could add one table in the draft tutorial to list the > new attribute names, but I prefer not to have any > Bio.Blast-compatible names in the code. >> I would have to lookup how the record attributes >> changed(=renamed) from those specific for blast to >> those generalized and used(=promoted) >> by SearchIO. >>?I see, hsp.sbjct_start is renamed to hsp.hit_start ... I would suggest to use the same names as in the XML source file. Then we are consistent with NCBI, we don't have to come up with our own names, and we won't have to provide a list of biopython-defined record attributes. Dropping the "Hsp" in , that would be "hit-from". Best, -Michiel. From p.j.a.cock at googlemail.com Tue Sep 25 11:03:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 12:03:15 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Tue, Sep 25, 2012 at 11:34 AM, Michiel de Hoon wrote: >> The same goes with the attribute names. I suppose I >> could add one table in the draft tutorial to list the >> new attribute names, but I prefer not to have any >> Bio.Blast-compatible names in the code. > >>> I would have to lookup how the record attributes >>> changed(=renamed) from those specific for blast to >>> those generalized and used(=promoted) >>> by SearchIO. > >>> I see, hsp.sbjct_start is renamed to hsp.hit_start ... > > I would suggest to use the same names as in the XML > source file. Then we are consistent with NCBI, we don' >t have to come up with our own names, and we won't > have to provide a list of biopython-defined record > attributes. Dropping the "Hsp" in , that > would be "hit-from". We can't be fully consistent with the NCBI since they have more than one naming convention ;) Personally I find the NCBI's human readable column names used in the tabular output far nicer than the verbose terms in the XML which is not really human readable, e.g. slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence The term 'subject' for the hit sequence is quite BLAST specific, but otherwise these terms are reasonably broad and could make sense in SearchIO beyond BLAST (assuming you don't find shortening the subject/query prefix to a single letter confusing). Currently the HSP object in SearchIO uses hit_start, hit_end, query_start and query_end - but also note that we're using Python counting. Peter From mmokrejs at fold.natur.cuni.cz Tue Sep 25 11:15:19 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 13:15:19 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <5051F9AD.104@fold.natur.cuni.cz> <506108C4.7010102@fold.natur.cuni.cz> Message-ID: <50619247.2080906@fold.natur.cuni.cz> Peter Cock wrote: > On Tue, Sep 25, 2012 at 2:28 AM, Martin Mokrejs > wrote: >> Hi Wibowo, >> will you also add the cElementTree calls to NCBIXML (replacing SAX parser)? >> I would have to lookup how the record attributes changed(=renamed) >> from those specific for blast to those generalized and used(=promoted) >> by SearchIO. Do you have a list of sed regexps? ;-) >> >> Or shall I just replace NCBIXML.parse() with SearchIO.parse()? ;-) >> Both would be certainly helpful at 3 a.m. :(( I am leaving the profiler >> running overnight and will at least be able to lookup where the bottleneck >> in the current NCBIXML is. The rest ... next time. ;-) > > We did discuss updating the internals of the old NCBIXML parser > to use ElementTree / cElementTree, but currently the plan is to > simply deprecate the old parser, so this seems a wasted effort. Then it means for me that some parts of my code will exist twice. As you said below the structuring of object in searchio vs. NCBIXML is different so I will really need two routines. :( One for newer installation and one for (most) older biopython versions. I would really suggest to spend some effort on optimizing the coding style of the old parser. The gain might be quite substantial and easy to gain for you and at no cost for end users. > >> I see, hsp.sbjct_start is renamed to hsp.hit_start ... How about exposing >> *additionally* the data through "old" names? So that "SearchIO" would expose >> both hsp.hit_start and also hsp.sbjct_start ... and maybe even more if blat >> or hmmer parsers expose so far other attributes as well. I know it is ugly but >> makes the transition smoother. ;-) Those are just references. ;-) I would >> like to do something like: >> >> try: >> # latest and greatest biopython version installed >> from Bio import SearchIO >> except ImportError: >> # some old installation >> from Bio import SeqIO >> >> >> but have the rest of my code unchanged. Umm, I use NCBIXML.parse() >> so I won't need even the above. You just change it in your git branch >> and I won't have to touch my code. That's fair, isn't it? ;-) > > The plan is to reward people for updating their code by giving them > faster BLAST XML parsing (and an easy way to try out other input > file formats in future). That will take a long while for people to switch over. Fix all HOWTOs and other docs all over the websites in the world ... That's a long shot. I would really try to provide a mapping interface so that people can just do the above try/except trick during module import. > > Note that Bio.SearchIO is the working name and current namespace > used on the branch, but is unlikely to be the final name. That's no problem for me. > > And I'm not keen on adding backwards compatible aliases for the old > BLAST parser names - even if they did come with deprecation warnings. > In fact I suspect even that wouldn't give you the drop in replacement > you are hoping for, the object heirachy has changed too. I understand you reasoning but maintaining two copies of functionally same code is boring for users as well. ;-) I can adjust for that myself, sure. > > However, if there are some specific cases where you think the old > name is still sensible given the broader scope of the new parser > covering many other formats as well as BLAST, then some minor > renames seems more reasonable. > >> Did you profile biopython or SearchIO yourself? >> Best, >> Martin > > Bow did some profiling of the old NCBIXML parser against his > SearchIO work. > > Peter > > From mmokrejs at fold.natur.cuni.cz Tue Sep 25 11:26:46 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 25 Sep 2012 13:26:46 +0200 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: <506194F6.9000103@fold.natur.cuni.cz> Peter Cock wrote: > Currently the HSP object in SearchIO uses hit_start, > hit_end, query_start and query_end - but also note > that we're using Python counting. Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement my code right now to use searchio but forgot to say that. I already did fix all the off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to rather use 1-based (where human is reading the output text files/tables). And these are scattered through the program (I think) and this will be probably the major stopper for me. ;) Things might break for me all over the places. I am not saying this is good idea but really, providing cElementTree calls from within NCBIXML would be more appealing to me (instead of current python-based expat parser calls). From p.j.a.cock at googlemail.com Tue Sep 25 12:26:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 13:26:48 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <506194F6.9000103@fold.natur.cuni.cz> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 12:26 PM, Martin Mokrejs wrote: > Peter Cock wrote: > >> Currently the HSP object in SearchIO uses hit_start, >> hit_end, query_start and query_end - but also note >> that we're using Python counting. > > Ah, thanks for the reminder. Yes, this is exactly why I wasn't very happy to re-implement > my code right now to use searchio but forgot to say that. I already did fix all the > off-by-one tweaks in my code to use somewhere the zero-based counting and somewhere to > rather use 1-based (where human is reading the output text files/tables). And these are > scattered through the program (I think) and this will be probably the major stopper for me. > ;) Things might break for me all over the places. Sadly whenever we are dealing with position input/output there will be off by one adjustments required. I think it is wise to use just one standard internally to a tool, and for Python that means zero based counting. > I am not saying this is good idea but really, providing cElementTree calls from within > NCBIXML would be more appealing to me (instead of current python-based expat parser > calls). OK - so there is at least one person making heaving use of the NCBIXML so we shouldn't rush to deprecate it after merging SearchIO, and there *is* some benefit from making it faster (but with the same API). In principle NCBIXML would be rewritten to use cElementTree /ElementTree and preserve the API - if you or anyone else want to do that (and the unit tests still pass), then I'm happy to review such changes. Likewise for less dramatic optimisations. Regards, Peter From p.j.a.cock at googlemail.com Tue Sep 25 13:32:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 14:32:07 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> Message-ID: On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock wrote: > > OK - so there is at least one person making heaving use of the > NCBIXML so we shouldn't rush to deprecate it after merging > SearchIO, and there *is* some benefit from making it faster > (but with the same API). > > In principle NCBIXML would be rewritten to use cElementTree > /ElementTree and preserve the API - if you or anyone else want > to do that (and the unit tests still pass), then I'm happy to review > such changes. Likewise for less dramatic optimisations. Martin emailed me to ask about this bit of the code, and it can be sped up - this shows about a 5% reduction: https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600 Summary parsing a 286MB XML file from BLASTX 2.2.26+ for 1000 genes against the NR database. NCBIXML before change: About 162s NCBIXML after change: About 154s NCBIXML removing debug: About 152s Using SearchIO: About 79s This is probably the same test file Bow gave numbers for earlier, although it seems SearchIO has less of an advantage on my machine (about x2) compared to Bow's machine (almost x5). (We should check memory usage too...) Peter --------------------------------------------- The full details, Before this change: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 161.8s real 2m41.894s user 2m41.208s sys 0m0.675s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 161.8s real 2m41.984s user 2m41.296s sys 0m0.677s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 162.6s real 2m42.771s user 2m41.995s sys 0m0.763s With this change: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 152.4s real 2m32.582s user 2m31.910s sys 0m0.663s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 153.5s real 2m33.680s user 2m32.977s sys 0m0.695s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 153.8s real 2m33.931s user 2m33.258s sys 0m0.661s And if we go further and remove _debug_ignore_list and this bit of debug code the saving is marginal: $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 151.5s real 2m31.611s user 2m30.934s sys 0m0.665s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 151.2s real 2m31.348s user 2m30.664s sys 0m0.674s $ time python time_ncbixml.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 152.9s real 2m32.994s user 2m32.314s sys 0m0.669s This is the timing script I used, $ more /tmp/time_ncbixml.py import sys import time from Bio.Blast import NCBIXML for f in sys.argv[1:]: start = time.time() count = 0 handle = open(f) for record in NCBIXML.parse(handle): count += 1 handle.close() print "%i records in %s in %0.1fs" % (count, f, time.time() - start) #End of file For comparison, here is the timing on the same setup but using SearchIO from Bow's current branch: $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 79.1s real 1m19.259s user 1m18.397s sys 0m0.799s $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 78.7s real 1m18.878s user 1m18.149s sys 0m0.719s $ time python time_searchio.py thousand_blastx_nr.xml 1000 records in thousand_blastx_nr.xml in 79.5s real 1m19.611s user 1m18.683s sys 0m0.918s And the script: $ more /tmp/time_searchio.py import sys import time from Bio import SearchIO for f in sys.argv[1:]: start = time.time() count = 0 handle = open(f) for record in SearchIO.parse(handle, "blast-xml"): count += 1 handle.close() print "%i records in %s in %0.1fs" % (count, f, time.time() - start) #End of file From golubchi at stats.ox.ac.uk Tue Sep 25 14:39:11 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Tue, 25 Sep 2012 15:39:11 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz>

Message-ID: <5061C20F.7040209@stats.ox.ac.uk> Hello, Apologies for not having followed the entire discussion, but just wanted to say that we're also using NCBIXML here and are likely to be incorporating it in a new piece of software soon, so it would be really unfortunate if some tags disappeared, were renamed or (even worse) changed meaning in future releases. I'm a bit late coming in here so maybe this has been answered, but is there a better parser that should be used at the moment? I was under the impression that NCBIXML is the only one. Thanks, Tanya On 25/09/12 14:32, Peter Cock wrote: > On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock wrote: >> >> OK - so there is at least one person making heaving use of the >> NCBIXML so we shouldn't rush to deprecate it after merging >> SearchIO, and there *is* some benefit from making it faster >> (but with the same API). >> >> In principle NCBIXML would be rewritten to use cElementTree >> /ElementTree and preserve the API - if you or anyone else want >> to do that (and the unit tests still pass), then I'm happy to review >> such changes. Likewise for less dramatic optimisations. > > Martin emailed me to ask about this bit of the code, and it > can be sped up - this shows about a 5% reduction: > https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600 > > Summary parsing a 286MB XML file from BLASTX 2.2.26+ > for 1000 genes against the NR database. > > NCBIXML before change: About 162s > NCBIXML after change: About 154s > NCBIXML removing debug: About 152s > Using SearchIO: About 79s > > This is probably the same test file Bow gave numbers for earlier, > although it seems SearchIO has less of an advantage on my > machine (about x2) compared to Bow's machine (almost x5). > > (We should check memory usage too...) > > Peter > > --------------------------------------------- > > The full details, > > Before this change: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 161.8s > > real 2m41.894s > user 2m41.208s > sys 0m0.675s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 161.8s > > real 2m41.984s > user 2m41.296s > sys 0m0.677s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 162.6s > > real 2m42.771s > user 2m41.995s > sys 0m0.763s > > > With this change: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 152.4s > > real 2m32.582s > user 2m31.910s > sys 0m0.663s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 153.5s > > real 2m33.680s > user 2m32.977s > sys 0m0.695s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 153.8s > > real 2m33.931s > user 2m33.258s > sys 0m0.661s > > And if we go further and remove _debug_ignore_list and > this bit of debug code the saving is marginal: > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 151.5s > > real 2m31.611s > user 2m30.934s > sys 0m0.665s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 151.2s > > real 2m31.348s > user 2m30.664s > sys 0m0.674s > > $ time python time_ncbixml.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 152.9s > > real 2m32.994s > user 2m32.314s > sys 0m0.669s > > This is the timing script I used, > > $ more /tmp/time_ncbixml.py > import sys > import time > from Bio.Blast import NCBIXML > for f in sys.argv[1:]: > start = time.time() > count = 0 > handle = open(f) > for record in NCBIXML.parse(handle): > count += 1 > handle.close() > print "%i records in %s in %0.1fs" % (count, f, time.time() - start) > #End of file > > For comparison, here is the timing on the same setup but using > SearchIO from Bow's current branch: > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 79.1s > > real 1m19.259s > user 1m18.397s > sys 0m0.799s > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 78.7s > > real 1m18.878s > user 1m18.149s > sys 0m0.719s > > $ time python time_searchio.py thousand_blastx_nr.xml > 1000 records in thousand_blastx_nr.xml in 79.5s > > real 1m19.611s > user 1m18.683s > sys 0m0.918s > > And the script: > > $ more /tmp/time_searchio.py > import sys > import time > from Bio import SearchIO > for f in sys.argv[1:]: > start = time.time() > count = 0 > handle = open(f) > for record in SearchIO.parse(handle, "blast-xml"): > count += 1 > handle.close() > print "%i records in %s in %0.1fs" % (count, f, time.time() - start) > #End of file > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Sep 25 16:00:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 17:00:45 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz>

<5061C20F.7040209@stats.ox.ac.uk> Message-ID: On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik wrote: > Hello, > > Apologies for not having followed the entire discussion, but just wanted > to say that we're also using NCBIXML here and are likely to be > incorporating it in a new piece of software soon, so it would be really > unfortunate if some tags disappeared, were renamed or (even worse) > changed meaning in future releases. > > I'm a bit late coming in here so maybe this has been answered, but is > there a better parser that should be used at the moment? I was under the > impression that NCBIXML is the only one. > > Thanks, > Tanya Hi Tanya, I hope I can reassure you there is nothing to worry about :) Right now there is only the NCBIXML parser, and we're not going to change it (except possibly to make it a little faster if people want to work on that). We're planning to a add new module based on Bow's GSoC code, under the working name SearchIO, which would cover BLAST, BLAT, HMMER, etc. This would have a different API and in the long term would probably replace all of Bio.Blast. http://biopython.org/wiki/SearchIO The discussion about possible changes has been (I think) only about this new code (and would have been better off on the development mailing list but this thread went off on a slight tangent). Once 'SearchIO' is released, we'd want to encourage people to use that instead of NCBIXML, with a view to deprecating and eventually removing NCBIXML. See: http://biopython.org/wiki/Deprecation_policy Regards, Peter From golubchi at stats.ox.ac.uk Thu Sep 27 11:35:35 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 27 Sep 2012 12:35:35 +0100 Subject: [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: