From mdehoon at c2b2.columbia.edu  Sat Oct  1 20:50:03 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Sat Oct  1 20:49:55 2005
Subject: [Biopython-dev] Blast
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD0D@cgcmail.cgc.cpmc.columbia.edu>

Thanks, Jeff. Currently, qblast in Bio.Blast.NCBIWWW can already return text
output via the format_type argument. Unfortunately, the standalone blast and
www-blast return slightly different text output, so we'd have to fix the
parser in Bio.Blast.NCBIStandalone for it to handle www-blast text output.

I found out that both standalone blast and www-blast can also return XML
output, which is identical (as far as I can tell) in both cases. I would
think that a parser that can read this XML output is most stable.
So I propose the following:

1) Let qblast return XML output by default; text and html output can be
returned by setting the format_type argument to qblast appropriately.
2) Write an XML parser that can read blast output from standalone and www
blast.
3) In a few versions, deprecate the text parser in NCBIStandalone and the
html parser in NCBIWWW. (This will only affect users of the text parser in
NCBIStandalone, since the html parser in NCBIWWW is already behind and cannot
parse blast output as it is).

Any objections, anybody?

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: Jeffrey Chang [mailto:jeffrey.chang@duke.edu]
Sent: Thu 9/29/2005 10:16 PM
To: Michiel De Hoon
Cc: biopython-dev@biopython.org
Subject: Re: [Biopython-dev] Blast
 
On Sep 29, 2005, at 1:46 PM, Michiel De Hoon wrote:

> To my surprise, the parser in Blast.NCBIWWW tries to parse HTML output
> instead of text output. My guess is that the HTML output changes  
> more often
> and is more difficult to parse than text output. So isn't it  
> possible to make
> NCBIWWW.qblast return text output instead of HTML and parse that  
> instead?
> So my question is, why was the choice made to parse HTML instead of  
> text? Is
> it simply because blast-on-the-web couldn't return text output in  
> the past?

You are right.  It was done that way in the past when the only way to  
use NCBI's BLAST was to use the HTML output.  (Actually, there was a  
version that you could access through a proprietary non-HTTP  
protocol, but the databases were not updated as frequently.)  Now  
that we can get text, perhaps it is time to encourage users to use  
the text one.  I believe the HTML parser is a few versions behind  
now, and unable to parse current BLAST output anymore.

Jeff


From mdehoon at c2b2.columbia.edu  Wed Oct  5 13:37:22 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Wed Oct  5 13:41:29 2005
Subject: [Biopython-dev] Blast
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD23@cgcmail.cgc.cpmc.columbia.edu>

Hi everybody,

Fixing the Blast problem turned out to be easier than I thought, as there was
already a parser (written by Bertrand Frottier) in Biopython that parses
Blast XML output. This Biopython project keeps amazing me.
So I just made XML output the default for qblast, and updated the
Tutorial/Cookbook chapter on Blast. Feel free to test it, and let me know if
there are any problems.

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: biopython-dev-bounces@portal.open-bio.org on behalf of Michiel De Hoon
Sent: Sat 10/1/2005 8:50 PM
To: Jeffrey Chang; biopython-dev@biopython.org
Subject: RE: [Biopython-dev] Blast
 
Thanks, Jeff. Currently, qblast in Bio.Blast.NCBIWWW can already return text
output via the format_type argument. Unfortunately, the standalone blast and
www-blast return slightly different text output, so we'd have to fix the
parser in Bio.Blast.NCBIStandalone for it to handle www-blast text output.

I found out that both standalone blast and www-blast can also return XML
output, which is identical (as far as I can tell) in both cases. I would
think that a parser that can read this XML output is most stable.
So I propose the following:

1) Let qblast return XML output by default; text and html output can be
returned by setting the format_type argument to qblast appropriately.
2) Write an XML parser that can read blast output from standalone and www
blast.
3) In a few versions, deprecate the text parser in NCBIStandalone and the
html parser in NCBIWWW. (This will only affect users of the text parser in
NCBIStandalone, since the html parser in NCBIWWW is already behind and cannot
parse blast output as it is).

Any objections, anybody?

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: Jeffrey Chang [mailto:jeffrey.chang@duke.edu]
Sent: Thu 9/29/2005 10:16 PM
To: Michiel De Hoon
Cc: biopython-dev@biopython.org
Subject: Re: [Biopython-dev] Blast
 
On Sep 29, 2005, at 1:46 PM, Michiel De Hoon wrote:

> To my surprise, the parser in Blast.NCBIWWW tries to parse HTML output
> instead of text output. My guess is that the HTML output changes  
> more often
> and is more difficult to parse than text output. So isn't it  
> possible to make
> NCBIWWW.qblast return text output instead of HTML and parse that  
> instead?
> So my question is, why was the choice made to parse HTML instead of  
> text? Is
> it simply because blast-on-the-web couldn't return text output in  
> the past?

You are right.  It was done that way in the past when the only way to  
use NCBI's BLAST was to use the HTML output.  (Actually, there was a  
version that you could access through a proprietary non-HTTP  
protocol, but the databases were not updated as frequently.)  Now  
that we can get text, perhaps it is time to encourage users to use  
the text one.  I believe the HTML parser is a few versions behind  
now, and unable to parse current BLAST output anymore.

Jeff


_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev


From sbassi at gmail.com  Wed Oct  5 13:48:19 2005
From: sbassi at gmail.com (Sebastian Bassi)
Date: Wed Oct  5 15:27:30 2005
Subject: [Biopython-dev] Blast
In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD23@cgcmail.cgc.cpmc.columbia.edu>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD23@cgcmail.cgc.cpmc.columbia.edu>
Message-ID: <b43bf2080510051048l3e4939baid7d53c8cb952fdaa@mail.gmail.com>

On 10/5/05, Michiel De Hoon <mdehoon@c2b2.columbia.edu> wrote:
> Fixing the Blast problem turned out to be easier than I thought, as there was
> already a parser (written by Bertrand Frottier) in Biopython that parses
> Blast XML output. This Biopython project keeps amazing me.
> So I just made XML output the default for qblast, and updated the
> Tutorial/Cookbook chapter on Blast. Feel free to test it, and let me know if
> there are any problems.

Is the updated document online?

Best regards,
SB.

--
<a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
web sin popups ni spyware: Usa Firefox en lugar de Internet
Explorer</a>

From mdehoon at c2b2.columbia.edu  Wed Oct  5 15:51:46 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Wed Oct  5 15:51:49 2005
Subject: [Biopython-dev] Blast
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD24@cgcmail.cgc.cpmc.columbia.edu>

> Is the updated document online?

Yes it is, see:
http://www.biopython.org/docs/tutorial/Tutorial004.html#toc10

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 13:28:31 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 13:30:57 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071728.j97HSVvj006722@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #1 from bill@barnard-engineering.com  2005-10-07 13:28 -------
Created an attachment (id=239)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=239&action=view)
Generates the Durbin Fig 2.5 matrix via simple method


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 13:30:00 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 13:30:58 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071730.j97HU0ox006746@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #2 from bill@barnard-engineering.com  2005-10-07 13:30 -------
Created an attachment (id=240)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=240&action=view)
Generates the Durbin Fig 2.5 matrix using slightly modified pairwise2.py


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 13:26:00 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 13:30:59 2005
Subject: [Biopython-dev] [Bug 1876] New: Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071726.j97HQ0UE006661@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876

           Summary: Bio.pairwise2 generates incorrect Needleman-Wunsch
                    score_matrix
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: bill@barnard-engineering.com


Investigation of Bio.pairwise2 to duplicate the alignment example from the text
"Biological sequence analysis" by R. Durbin, et. al. reveals that although the
alignments returned for the example x = 'HEAGAWGHEE', y = 'PAWHEAE' are
correct, the underlying scoring matrix is not correct.

The Biopython version I'm using is from CVS, up-to-date as of 7 Oct 2005.

My analysis shows that the scoring matrix entries are correct for each entry
F(i,j) where one of the traceback vectors points to F(i-1,j-1). If the
traceback vectors do not contain a pointer to the diagonally previous entry,
then the F(i,j) entry is calculated incorrectly.

For this initial bug report I will show output of two programs that generate
the scoring matrix for this example by two methods. I will attach some
supporting files to this bug report following the initial commit. These files
will make it easy to reproduce the bug.

The output from my simple program that duplicates the example in Durbin (there
is one entry in the Durbin text that is in error) is:

Score matrix for Figure 2.5 example in Durbin text
     x:    H    E    A    G    A    W    G    H    E    E 
y:    0   -8  -16  -24  -32  -40  -48  -56  -64  -72  -80 
 P   -8   -2   -9  -17  -25  -33  -41  -49  -57  -65  -73 
 A  -16  -10   -3   -4  -12  -20  -28  -36  -44  -52  -60 
 W  -24  -18  -11   -6   -7  -15   -5  -13  -21  -29  -37 
 H  -32  -14  -18  -13   -8   -9  -13   -7   -3  -11  -19 
 E  -40  -22   -8  -16  -16   -9  -12  -15   -7    3   -5 
 A  -48  -30  -16   -3  -11  -11  -12  -12  -15   -5    2 
 E  -56  -38  -24  -11   -6  -12  -14  -15  -12   -9    1 

The output from the (slightly) modified pairwise2.py code is:

Global alignment:
HEAGAWGHE-E
-P--AW-HEAE
score: 1
alignment: begin = 0, end = 11

pairwise2 Score matrix for Figure 2.5 example in Durbin text
    x:    H    E    A    G    A    W    G    H    E    E 
y:   x    x    x    x    x    x    x    x    x    x    x 
 P   x   -2   -9  -17  -26  -33  -44  -50  -58  -65  -73 
 A   x  -10   -3   -4  -17  -20  -36  -41  -51  -58  -66 
 W   x  -19  -13   -6   -7  -15   -5  -31  -39  -47  -55 
 H   x  -14  -18  -13   -8   -9  -18   -7   -3  -21  -29 
 E   x  -32   -8  -19  -16   -9  -12  -16   -7    3   -5 
 A   x  -42  -23   -3  -16  -11  -12  -12  -17   -8    2 
 E   x  -48  -24  -17   -6  -12  -14  -15  -12   -9    1 

My supporting attachments will include the driver programs that generated these
outputs, along with the patch to modify pairwise2.py so it returns the
score_matrix.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 13:31:38 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 14:31:15 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071731.j97HVcsM006814@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #3 from bill@barnard-engineering.com  2005-10-07 13:31 -------
Created an attachment (id=241)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=241&action=view)
patch pairwise2.py so it returns the score_matrix

Run against local copy of pairwise2.py as shown:

% patch pairwise2.py pairwise2.patch


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 13:34:55 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 14:31:17 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071734.j97HYtqt006889@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #4 from bill@barnard-engineering.com  2005-10-07 13:34 -------
Created an attachment (id=242)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=242&action=view)
Spreadsheet generated comparison of Durbin vs Biopython score_matrix

This pdf shows graphically where the differences lie between the two methods.
It demonstrates that the erroneous entries all lie in those cells whose
traceback pointers point to either F(i-1,j) or F(i,j-1). All entries whose
traceback pointers include a pointer to F(i-1,j-1) are correct.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 15:58:06 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 16:34:02 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510071958.j97Jw65J010495@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


jchang@biopython.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #5 from jchang@biopython.org  2005-10-07 15:58 -------
The two implementations use different algorithms, and the bookkeeping in the
score matrix is done differently.  For an illustration of the algorithm used in
Biopython, see:
http://www.maths.tcd.ie/~lily/pres2/sld006.htm

The score matrices are different and not comparable.  However, the final
alignment, score, and traceback should be the same.  Please let me know if they
are not.

I chose this algorithm because it was simpler for me to generate a correct
traceback.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 18:38:57 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 19:31:14 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510072238.j97McvtB012857@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #6 from bill@barnard-engineering.com  2005-10-07 18:38 -------
Since this (http://www.maths.tcd.ie/~lily/pres2/sld003.htm) algorithm is the
first entry found by googling "needleman wunsch", I should have read it more
carefully. It is now clear to me why the two score matrices are different. (In
this case they are tantalizingly similar...)

It might make a useful test for the module, to create alignment tuples with the
Durbin algorithm and compare those produced with the "McLysaght" algorithm.
I'll consider creating some tests like that, and contribute those.

I have not yet convinced myself the two algorithms are equivalent. Probably a
literature search in my local UC library would resolve that question. Do you
have a reference or web pointer to appropriate papers?

I would probably change this bug's status to "INVALID", which I presume is the
one for "Not a bug".


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct  7 20:51:21 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct  7 21:31:17 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510080051.j980pLw1014842@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


jchang@biopython.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |INVALID


------- Comment #7 from jchang@biopython.org  2005-10-07 20:51 -------
The algorithms produce equivalent scores and alignments when the gap penalties
are linear.  However, the algorithm implemented in Biopython is more general
and can handle more exotic non-linear models of gap penalties.

It's been many years since I've looked at this, but IIRC the original
Needleman-Wunsch paper described the algorithm implemented in Biopython, and
the algorithm in Durbin is a refinement made later to increase its speed.  The
refinement is much faster [ O(NM) vs O(NNM) ].  In biopython, for the case of
affine gap penalties, the alignment algorithm in _make_score_matrix_fast is a
hybrid of the two approaches, that has O(NM) running time, while also being
easier (for me) to understand and debug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sun Oct  9 14:35:44 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Oct  9 15:31:56 2005
Subject: [Biopython-dev] [Bug 1735] Bio.Blast.NCBIStandalone.BlastParser
	crashs with unusual alignment fragments
Message-ID: <200510091835.j99IZiwk007347@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1735


------- Comment #2 from mdehoon@ims.u-tokyo.ac.jp  2005-10-09 14:35 -------
Does this error also appear with the XML-based parser in NCBIXML? If not, it's
not worth fixing the text-based parser in NCBIStandalone.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Oct 12 13:40:25 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Oct 12 14:31:41 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510121740.j9CHePW7015206@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #8 from bill@barnard-engineering.com  2005-10-12 13:40 -------
FWIW your traceback algorithm works perfectly as is for the Gotoh/Durbin/NW
algorithm. The internal trace_matrix is somewhat different since each pointer
may only point to three possible cells rather than to the previous max score
cell (IIRC...).

Anyway I used your algorithm in my program. (I want to be sure I understand all
the basics before I move to something more challenging.) I tried both
extracting the portions I needed, and simply importing it directly; both work
perfectly for my example alignment.

I fooled around a bit with making test routines to compare the output of the
Gotoh algorithm to the NW algorithm. I learned a bit, but I don't think that
adding these tests to the test_pairwise2 module would really add anything
useful.

Thanks for making all this publicly available. I really like the way your
program uses the __call__ ==> decode methods to enable flexible use of the
alignment programs. It's opened my eyes to Python's capabilities.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Oct 12 16:43:14 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Oct 12 17:31:43 2005
Subject: [Biopython-dev] [Bug 1876] Bio.pairwise2 generates incorrect
	Needleman-Wunsch score_matrix
Message-ID: <200510122043.j9CKhEPl020539@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1876


------- Comment #9 from jchang@biopython.org  2005-10-12 16:43 -------
That's good news.  I'm glad the code is useful for you!

Jeff


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From Mark.Hoebeke at jouy.inra.fr  Mon Oct 17 10:07:13 2005
From: Mark.Hoebeke at jouy.inra.fr (Mark Hoebeke)
Date: Mon Oct 17 10:36:19 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
Message-ID: <4353B011.5030004@jouy.inra.fr>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I wanted a quick and easy way to determine the endpoints of HSPs extraced from
Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks the
query_end and sbjct_end attributes. Googling around led me to a recipe
describing how to compute the endpoint using the total length, gap length and
other niceties. Not exactly intuitive to me.

Hence I dove into the NCBIStandalone and HSP modules and made some slight
modifications. Basically I added the two attributes to HSP and the following
snippets to NCBIStandalone (release 1.4b):

972c972
<     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
- ---
>     _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
977,978c977
<         start, seq, end = m.groups()
<       self._hsp.query_end=string.atoi(end);
- ---
>         start, seq = m.groups()
997,998c996,997
<         start, seq, end = _re_search(
<             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
- ---
>         start, seq = _re_search(
>             r"Sbjct: (\d+)\s*(.+) \d", line,
1014c1013
<       self._hsp.sbjct_end=string.atoi(end)
- ---
>

Looks to easy to be true, I thought. Now sorry if I'm missing some important
issues here (I'm quite new to BioPython), but is there a reason no one has made
this patch yet ?

Thanks for any comments (flames and others.)

Cheers,

Mark


- --
- ----------------------------Mark.Hoebeke@jouy.inra.fr-----------------------
Unit? Statistique & G?nome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
T?l : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
F-91000,                        _/  _/    _/                            Evry
PGP : A2AD52E3           _/_/_/      _/_/_/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU7ARa3nTV6KtUuMRArBqAKC/m4i+VpVaU3clvOkMuYkfRrZQ+QCfbRKg
gBBW5wNKS3sb/Uqr31eumx8=
=vSWV
-----END PGP SIGNATURE-----
From mdehoon at c2b2.columbia.edu  Mon Oct 17 11:27:28 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Mon Oct 17 11:33:34 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD50@cgcmail.cgc.cpmc.columbia.edu>

Just to make sure I understand what you're doing:

Are the query_end and sbjct_end attributes found in the Blast output, or do
you calculate them from the other attributes in the Blast output? If they're
in the Blast output,
1) Do they always appear in the Blast output, or does it depend on the query?
In the latter case, does the modified Blast parser choke on Blast output that
do not contain these attributes?
2) Does these attributes also appear in Blast XML output? The XML parser is
easier to maintain than the text-based parser in BlastStandalone, may
therefore become the main Blast parser in Biopython in the long run.

--Michiel. 


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: biopython-dev-bounces@portal.open-bio.org on behalf of Mark Hoebeke
Sent: Mon 10/17/2005 10:07 AM
To: biopython-dev@biopython.org
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I wanted a quick and easy way to determine the endpoints of HSPs extraced
from
Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
the
query_end and sbjct_end attributes. Googling around led me to a recipe
describing how to compute the endpoint using the total length, gap length and
other niceties. Not exactly intuitive to me.

Hence I dove into the NCBIStandalone and HSP modules and made some slight
modifications. Basically I added the two attributes to HSP and the following
snippets to NCBIStandalone (release 1.4b):

972c972
<     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
- ---
>     _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
977,978c977
<         start, seq, end = m.groups()
<       self._hsp.query_end=string.atoi(end);
- ---
>         start, seq = m.groups()
997,998c996,997
<         start, seq, end = _re_search(
<             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
- ---
>         start, seq = _re_search(
>             r"Sbjct: (\d+)\s*(.+) \d", line,
1014c1013
<       self._hsp.sbjct_end=string.atoi(end)
- ---
>

Looks to easy to be true, I thought. Now sorry if I'm missing some important
issues here (I'm quite new to BioPython), but is there a reason no one has
made
this patch yet ?

Thanks for any comments (flames and others.)

Cheers,

Mark


- --
-
----------------------------Mark.Hoebeke@jouy.inra.fr-----------------------
Unit? Statistique & G?nome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
T?l : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
F-91000,                        _/  _/    _/                            Evry
PGP : A2AD52E3           _/_/_/      _/_/_/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU7ARa3nTV6KtUuMRArBqAKC/m4i+VpVaU3clvOkMuYkfRrZQ+QCfbRKg
gBBW5wNKS3sb/Uqr31eumx8=
=vSWV
-----END PGP SIGNATURE-----
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Mon Oct 17 13:51:21 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Mon Oct 17 13:52:22 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD54@cgcmail.cgc.cpmc.columbia.edu>

The current patch breaks the parser if the Blast output does not contain
query_end and sbjct_end. The problem seems to be in the line:
        start, seq, end = m.groups()
(traceback ends with
  File "/usr/local/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py",
line 995, in query
    start, seq, end = m.groups()
ValueError: need more than 2 values to unpack).
But this should be easy to fix.

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: Mark Hoebeke [mailto:Mark.Hoebeke@jouy.inra.fr]
Sent: Mon 10/17/2005 1:05 PM
To: Michiel De Hoon
Cc: biopython-dev@biopython.org
Subject: Re: [Biopython-dev] NCBIStandalone Blast HSP parsing
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michiel De Hoon wrote:
> Just to make sure I understand what you're doing:
> 
> Are the query_end and sbjct_end attributes found in the Blast output, or do
> you calculate them from the other attributes in the Blast output? 

I directly grab them from the Blast report.

>If they're
> in the Blast output,
> 1) Do they always appear in the Blast output, or does it depend on the
query?
> In the latter case, does the modified Blast parser choke on Blast output
that
> do not contain these attributes?

The patterns in the official release 1.4b module check for "a single
digit" following the string of sequence characters at the end of the
alignment lines.

All I did was to extend the patterns to "one or more digits" and to
capture them in order to store their contents in the HSP attributes. So
AFAIK, the patch does not change the way reports are currently parsed.

> 2) Does these attributes also appear in Blast XML output? The XML parser is
> easier to maintain than the text-based parser in BlastStandalone, may
> therefore become the main Blast parser in Biopython in the long run.

With the sequence set I'm currently working on (and with NCBI Blast
2.2.12), the XML output has indeed the following elements : Hsp_query-to
and Hsp_hit-to which seem to have the intended meaning.

I suppose I should be able to  adapt the XML parser while I'm on it, if
it is officially accepted.

Mark

> 
> --Michiel. 
> 
> 
> 
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces@portal.open-bio.org on behalf of Mark Hoebeke
> Sent: Mon 10/17/2005 10:07 AM
> To: biopython-dev@biopython.org
> Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
>  
> Hi all,
> 
> I wanted a quick and easy way to determine the endpoints of HSPs extraced
> from
> Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
> the
> query_end and sbjct_end attributes. Googling around led me to a recipe
> describing how to compute the endpoint using the total length, gap length
and
> other niceties. Not exactly intuitive to me.
> 
> Hence I dove into the NCBIStandalone and HSP modules and made some slight
> modifications. Basically I added the two attributes to HSP and the
following
> snippets to NCBIStandalone (release 1.4b):
> 
> 972c972
> <     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
> ---
> 
>>>    _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
> 
> 977,978c977
> <         start, seq, end = m.groups()
> <       self._hsp.query_end=string.atoi(end);
> ---
> 
>>>        start, seq = m.groups()
> 
> 997,998c996,997
> <         start, seq, end = _re_search(
> <             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
> ---
> 
>>>        start, seq = _re_search(
>>>            r"Sbjct: (\d+)\s*(.+) \d", line,
> 
> 1014c1013
> <       self._hsp.sbjct_end=string.atoi(end)
> ---
> 
> 
> Looks to easy to be true, I thought. Now sorry if I'm missing some
important
> issues here (I'm quite new to BioPython), but is there a reason no one has
> made
> this patch yet ?
> 
> Thanks for any comments (flames and others.)
> 
> Cheers,
> 
> Mark
> 
> 
> --
> -
>
----------------------------Mark.Hoebeke@jouy.inra.fr-----------------------
> Unit? Statistique & G?nome    _/_/_/    _/_/_/
http://stat.genopole.cnrs.fr
> T?l : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38
09
> Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des
Terrasses
> F-91000,                        _/  _/    _/
Evry
> PGP : A2AD52E3           _/_/_/      _/_/_/
> 
> 
> 
> 
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev

- --
- -------------------------Mark.Hoebeke@jouy.inra.fr---------------------
Unit? Statistique & G?nome                                    Unit? MIG
+33 (0)1 60 87 38 03                   T?l.        +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                   Fax.        +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses            INRA - Domaine de Vilvert
F - 91000 Evry                            F - 78352 Jouy-en-Josas CEDEX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU9nxa3nTV6KtUuMRApqXAJ9a9z7J0bvigZ1NiZZxmTUziMocIgCdE0O9
EvX5Bm6f7dMcAUFGfNIO8tk=
=mWo3
-----END PGP SIGNATURE-----


From bugzilla-daemon at portal.open-bio.org  Mon Oct 17 13:53:07 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Oct 17 14:31:57 2005
Subject: [Biopython-dev] [Bug 1715] Bio.Blast.NCBIStandalone does not
	support standalone NCBI RPS-Blast (rpsblast) output
Message-ID: <200510171753.j9HHr78k000803@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1715


------- Comment #10 from mdehoon@ims.u-tokyo.ac.jp  2005-10-17 13:53 -------
Is this modification of NCBIStandalone.py still relevant for Biopython 1.40b?
If so, could you submit a patch (instead of an edited version of
NCBIStandalone.py)? Also, the edited version contains many differences to the
CVS version that do not seem relevent for rpsblast (differences in tabs versus
spaces, for example) that make it difficult to assess this patch.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From Mark.Hoebeke at jouy.inra.fr  Mon Oct 17 13:05:53 2005
From: Mark.Hoebeke at jouy.inra.fr (Mark Hoebeke)
Date: Mon Oct 17 14:52:26 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD50@cgcmail.cgc.cpmc.columbia.edu>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD50@cgcmail.cgc.cpmc.columbia.edu>
Message-ID: <4353D9F1.4050607@jouy.inra.fr>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michiel De Hoon wrote:
> Just to make sure I understand what you're doing:
> 
> Are the query_end and sbjct_end attributes found in the Blast output, or do
> you calculate them from the other attributes in the Blast output? 

I directly grab them from the Blast report.

>If they're
> in the Blast output,
> 1) Do they always appear in the Blast output, or does it depend on the query?
> In the latter case, does the modified Blast parser choke on Blast output that
> do not contain these attributes?

The patterns in the official release 1.4b module check for "a single
digit" following the string of sequence characters at the end of the
alignment lines.

All I did was to extend the patterns to "one or more digits" and to
capture them in order to store their contents in the HSP attributes. So
AFAIK, the patch does not change the way reports are currently parsed.

> 2) Does these attributes also appear in Blast XML output? The XML parser is
> easier to maintain than the text-based parser in BlastStandalone, may
> therefore become the main Blast parser in Biopython in the long run.

With the sequence set I'm currently working on (and with NCBI Blast
2.2.12), the XML output has indeed the following elements : Hsp_query-to
and Hsp_hit-to which seem to have the intended meaning.

I suppose I should be able to  adapt the XML parser while I'm on it, if
it is officially accepted.

Mark

> 
> --Michiel. 
> 
> 
> 
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces@portal.open-bio.org on behalf of Mark Hoebeke
> Sent: Mon 10/17/2005 10:07 AM
> To: biopython-dev@biopython.org
> Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
>  
> Hi all,
> 
> I wanted a quick and easy way to determine the endpoints of HSPs extraced
> from
> Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
> the
> query_end and sbjct_end attributes. Googling around led me to a recipe
> describing how to compute the endpoint using the total length, gap length and
> other niceties. Not exactly intuitive to me.
> 
> Hence I dove into the NCBIStandalone and HSP modules and made some slight
> modifications. Basically I added the two attributes to HSP and the following
> snippets to NCBIStandalone (release 1.4b):
> 
> 972c972
> <     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
> ---
> 
>>>    _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
> 
> 977,978c977
> <         start, seq, end = m.groups()
> <       self._hsp.query_end=string.atoi(end);
> ---
> 
>>>        start, seq = m.groups()
> 
> 997,998c996,997
> <         start, seq, end = _re_search(
> <             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
> ---
> 
>>>        start, seq = _re_search(
>>>            r"Sbjct: (\d+)\s*(.+) \d", line,
> 
> 1014c1013
> <       self._hsp.sbjct_end=string.atoi(end)
> ---
> 
> 
> Looks to easy to be true, I thought. Now sorry if I'm missing some important
> issues here (I'm quite new to BioPython), but is there a reason no one has
> made
> this patch yet ?
> 
> Thanks for any comments (flames and others.)
> 
> Cheers,
> 
> Mark
> 
> 
> --
> -
> ----------------------------Mark.Hoebeke@jouy.inra.fr-----------------------
> Unit? Statistique & G?nome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
> T?l : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
> Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
> F-91000,                        _/  _/    _/                            Evry
> PGP : A2AD52E3           _/_/_/      _/_/_/
> 
> 
> 
> 
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev

- --
- -------------------------Mark.Hoebeke@jouy.inra.fr---------------------
Unit? Statistique & G?nome                                    Unit? MIG
+33 (0)1 60 87 38 03                   T?l.        +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                   Fax.        +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses            INRA - Domaine de Vilvert
F - 91000 Evry                            F - 78352 Jouy-en-Josas CEDEX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDU9nxa3nTV6KtUuMRApqXAJ9a9z7J0bvigZ1NiZZxmTUziMocIgCdE0O9
EvX5Bm6f7dMcAUFGfNIO8tk=
=mWo3
-----END PGP SIGNATURE-----
From y.benita at wanadoo.nl  Mon Oct 17 19:45:47 2005
From: y.benita at wanadoo.nl (Yair Benita)
Date: Mon Oct 17 20:15:13 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
In-Reply-To: <BF79FC02.8FFE%y.benita@wanadoo.nl>
Message-ID: <BF7A044B.9000%y.benita@wanadoo.nl>

Hi Michael,
This issue has already been fixed. In the last review of NCBIstandalone I
made with Jeff Chang the query_end and sbjct_end were added.
Just grab the latest NCBIstandalone version from CVS.

Yair

> From: Mark Hoebeke <Mark.Hoebeke@jouy.inra.fr>
> Organization: INRA - MIA
> Date: Mon, 17 Oct 2005 16:07:13 +0200
> To: <biopython-dev@biopython.org>
> Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi all,
> 
> I wanted a quick and easy way to determine the endpoints of HSPs extraced from
> Blast reports parser with NCBIStandalone. Unfortunately the HSP class lacks
> the
> query_end and sbjct_end attributes. Googling around led me to a recipe
> describing how to compute the endpoint using the total length, gap length and
> other niceties. Not exactly intuitive to me.
> 
> Hence I dove into the NCBIStandalone and HSP modules and made some slight
> modifications. Basically I added the two attributes to HSP and the following
> snippets to NCBIStandalone (release 1.4b):
> 
> 972c972
> <     _query_re = re.compile(r"Query: (\d+)\s*(.+) (\d+)")
> - ---
>>     _query_re = re.compile(r"Query: (\d+)\s*(.+) \d")
> 977,978c977
> <         start, seq, end = m.groups()
> <       self._hsp.query_end=string.atoi(end);
> - ---
>>         start, seq = m.groups()
> 997,998c996,997
> <         start, seq, end = _re_search(
> <             r"Sbjct: (\d+)\s*(.+) (\d+)", line,
> - ---
>>         start, seq = _re_search(
>>             r"Sbjct: (\d+)\s*(.+) \d", line,
> 1014c1013
> <       self._hsp.sbjct_end=string.atoi(end)
> - ---
>> 
> 
> Looks to easy to be true, I thought. Now sorry if I'm missing some important
> issues here (I'm quite new to BioPython), but is there a reason no one has
> made
> this patch yet ?
> 
> Thanks for any comments (flames and others.)
> 
> Cheers,
> 
> Mark
> 
> 
> - --
> - ----------------------------Mark.Hoebeke@jouy.inra.fr-----------------------
> Unit? Statistique & G?nome    _/_/_/    _/_/_/  http://stat.genopole.cnrs.fr
> T?l : +33 (0)1 60 87 38 03  _/        _/          Fax : +33 (0)1 60 87 38 09
> Tour Evry 2,                 _/_/    _/  _/_/         523, pl. des Terrasses
> F-91000,                        _/  _/    _/                            Evry
> PGP : A2AD52E3           _/_/_/      _/_/_/
> 
> 
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
> 
> iD8DBQFDU7ARa3nTV6KtUuMRArBqAKC/m4i+VpVaU3clvOkMuYkfRrZQ+QCfbRKg
> gBBW5wNKS3sb/Uqr31eumx8=
> =vSWV
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From Mark.Hoebeke at jouy.inra.fr  Tue Oct 18 00:47:12 2005
From: Mark.Hoebeke at jouy.inra.fr (Mark Hoebeke)
Date: Tue Oct 18 00:47:32 2005
Subject: [Biopython-dev] NCBIStandalone Blast HSP parsing
In-Reply-To: <BF7A044B.9000%y.benita@wanadoo.nl>
References: <BF7A044B.9000%y.benita@wanadoo.nl>
Message-ID: <43547E50.6010600@jouy.inra.fr>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yair Benita wrote:
> Hi Michael,
> This issue has already been fixed. In the last review of NCBIstandalone I
> made with Jeff Chang the query_end and sbjct_end were added.
> Just grab the latest NCBIstandalone version from CVS.
> 
> Yair
> 

The patch is indeed cleaner than my submission.

Sorry I didn't check beforehand.

Many thanks.

Mark


- --
- -------------------------Mark.Hoebeke@jouy.inra.fr---------------------
Unit? Statistique & G?nome                                    Unit? MIG
+33 (0)1 60 87 38 03                   T?l.        +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                   Fax.        +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses            INRA - Domaine de Vilvert
F - 91000 Evry                            F - 78352 Jouy-en-Josas CEDEX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDVH5Qa3nTV6KtUuMRAladAKCnPCTfc1ZVRTjSlcS04EvfYlRShACfQIF7
CFKgGBooaWKQnCWunjuespo=
=Z1fd
-----END PGP SIGNATURE-----
From bugzilla-daemon at portal.open-bio.org  Tue Oct 18 11:53:07 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Oct 18 12:32:39 2005
Subject: [Biopython-dev] [Bug 1715] Bio.Blast.NCBIStandalone does not
	support standalone NCBI RPS-Blast (rpsblast) output
Message-ID: <200510181553.j9IFr7ud025687@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1715


------- Comment #11 from biopython-bugzilla@maubp.freeserve.co.uk  2005-10-18 11:53 -------
I have just had another look at this code of mine...

It turns out that the plain text output from RPS-BLAST 2.2.12 is slightly
different to that from 2.2.10 which I was using.

I'm wondering about supporting the RPS-BLAST XML output instead, as Michiel de
Hoon has indicated a preference to move to this in the future...

http://www.biopython.org/pipermail/biopython-dev/2005-October/002130.html


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at c2b2.columbia.edu  Tue Oct 18 20:26:00 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue Oct 18 20:25:01 2005
Subject: [Biopython-dev] Bug 1741
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD57@cgcmail.cgc.cpmc.columbia.edu>

Hi everybody,

Bug #1741 complains about the fact that fasta consumer example in the
tutorial (and the corresponding example in Doc/examples) no longer works. The
reason behind this is that the Fasta parser switched to Martel in revision
1.9 of Fasta/__init__.py, and therefore we no longer have a _Scanner class in
Fasta/__init__.py, which causes the example in the tutorial to fail. So my
question is, is Section 2.4.2 in the tutorial still relevant? If so, does
anybody understand Fasta well enough to be able to fix it? If not, can we get
rid of it?

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From betainverse at gmail.com  Wed Oct 19 12:54:03 2005
From: betainverse at gmail.com (Katie Edmonds)
Date: Wed Oct 19 14:41:28 2005
Subject: [Biopython-dev] KEGG questions
Message-ID: <8e76d5310510190954y6d989857qcc367c8751e5a6c7@mail.gmail.com>

I've been working on trying to make the KEGG Compound module useable. 
Before I spend more time on it, I'd like to make sure there isn't a
more recent version than the one I see in cvs from 2001
(compound_format.py) and 2004 (__init__.py).

I'm also curious how nice a new version should be before it's
reasonable to submit it.  At this point, I've added fields to
compound_format.py that apparently didn't exist in the past, so that
the parser will at least not crash, but I don't really understand the
Martel well enough to get it to parse any of the multiline fields as
one would like (the best I've done so far with ENZYME, for example,
misses all the enzymes that don't have a role listed after the enzyme
id).  Similarly, all I've successfully done so far to __init__.py is
to add support for compound mass.

 Would it be appropriate for me to submit my changes at this point? 
Or would it be best if I kept my changes to myself until I can make
the compound_format work in a more general and appropriate way?

Thanks,

Katie

From bugzilla-daemon at portal.open-bio.org  Wed Oct 19 14:58:47 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Oct 19 15:32:01 2005
Subject: [Biopython-dev] [Bug 1745] Genbank parser and REGION fields
Message-ID: <200510191858.j9JIwlmD032684@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1745


mdehoon@ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
          Component|Martel/Mindy                |Main Distribution
         OS/Version|SunOS                       |All
           Platform|Sun                         |All
         Resolution|                            |FIXED


------- Comment #4 from mdehoon@ims.u-tokyo.ac.jp  2005-10-19 14:58 -------
Fixed in CVS, please try it out to make sure it works.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at c2b2.columbia.edu  Wed Oct 19 19:23:15 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Wed Oct 19 19:22:16 2005
Subject: [Biopython-dev] Biopython new release coming up
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>

Hi everybody,

It's been a while since our latest Biopython release. I'm planning to put
together a new release (version 1.41, code-named "Manhattan") and hoping to
release it by the end of next week. So if you have some code sitting around
that is stable enough for a Biopython distribution, this would be a good time
to commit it to CVS.
Currently, the following tests fail:
  test_MEME ---> because we don't have the test_MEME output file for
comparison
  test_Nexus
  test_PDB
  test_Registry
  test_SCOP_Astral
So actually, we're in pretty good shape. If one of these modules is yours,
please have a look at them and (if possible) try to fix them.

Also, in Bugzilla, 14 bugs are still open, please have a look to see if there
is something we can do about them.

Finally, if a module was significantly changed since the previous release (18
February 2005), or if a module has been added since then, it would be nice to
have a summary for the web page.

Good luck everybody, and may the gods of software development be on our side.

--Michiel.


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From biopython-dev at maubp.freeserve.co.uk  Thu Oct 20 13:40:19 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu Oct 20 14:22:32 2005
Subject: [Biopython-dev] Biopython new release coming up
In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
Message-ID: <4357D683.4070508@maubp.freeserve.co.uk>

Michiel De Hoon wrote:
> Hi everybody,
> 
> It's been a while since our latest Biopython release. I'm planning to put
> together a new release (version 1.41, code-named "Manhattan") and hoping to
> release it by the end of next week. So if you have some code sitting around
> that is stable enough for a Biopython distribution, this would be a good time
> to commit it to CVS.
> Currently, the following tests fail:

In BioPython 1.40b there where no tests of NCBIXML.py

Do any of you have a collection of XML blast output?

Ideally the test set should cover both the online and standalone 
versions of blast, the different programs (blastn, blastp etc), and a 
range of recent releases (e.g. 2.2.10 to 2.2.12 say).

I have prepared a set of XML from the current blast webserver, attached 
to this email:

'xbt001', # BLASTP 2.2.12, gi|49176427|ref|NP_418280.3|
'xbt002', # BLASTN 2.2.12, gi|1348916|gb|G26684.1|G26684
'xbt003', # BLASTX 2.2.12, gi|1347369|gb|G25137.1|G25137
'xbt004', # TBLASTN 2.2.12, gi|729325|sp|P39483|DHG2_BACME
'xbt005', # TBLASTX 2.2.12, gi|1348853|gb|G26621.1|G26621, BLOSUM80

[I assume the existing tests are bt### for blast test ###, so mine are 
xbt### for XML blast test ###]

I have also attempted to create a test_NCBIXML.py based on Jeffrey 
Chang's test_NCBIStandalone.py but most of it doesn't seem to apply 
(e.g. there is no _scanner, and as far as I can tell, 
rec.multiple_alignment does not apply to XML blast data).

Any comments?

Once this is pinned down, I want to try and try dealing with some 
RPS-BLAST XML output with NCBIXML.py...

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xml_blast_test_files.zip
Type: application/x-zip-compressed
Size: 97083 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20051020/ecc4ff24/xml_blast_test_files-0001.bin
From mdehoon at c2b2.columbia.edu  Thu Oct 20 16:08:57 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu Oct 20 16:10:03 2005
Subject: [Biopython-dev] Biopython new release coming up
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD60@cgcmail.cgc.cpmc.columbia.edu>

Thanks, Peter!
I have added these files to CVS and written a simple test_NCBIXML.py for it
(also in CVS).

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: Peter [mailto:biopython-dev@maubp.freeserve.co.uk]
Sent: Thu 10/20/2005 1:40 PM
To: Michiel De Hoon
Cc: biopython-dev@biopython.org
Subject: Re: [Biopython-dev] Biopython new release coming up
 
Michiel De Hoon wrote:
> Hi everybody,
> 
> It's been a while since our latest Biopython release. I'm planning to put
> together a new release (version 1.41, code-named "Manhattan") and hoping to
> release it by the end of next week. So if you have some code sitting around
> that is stable enough for a Biopython distribution, this would be a good
time
> to commit it to CVS.
> Currently, the following tests fail:

In BioPython 1.40b there where no tests of NCBIXML.py

Do any of you have a collection of XML blast output?

Ideally the test set should cover both the online and standalone 
versions of blast, the different programs (blastn, blastp etc), and a 
range of recent releases (e.g. 2.2.10 to 2.2.12 say).

I have prepared a set of XML from the current blast webserver, attached 
to this email:

'xbt001', # BLASTP 2.2.12, gi|49176427|ref|NP_418280.3|
'xbt002', # BLASTN 2.2.12, gi|1348916|gb|G26684.1|G26684
'xbt003', # BLASTX 2.2.12, gi|1347369|gb|G25137.1|G25137
'xbt004', # TBLASTN 2.2.12, gi|729325|sp|P39483|DHG2_BACME
'xbt005', # TBLASTX 2.2.12, gi|1348853|gb|G26621.1|G26621, BLOSUM80

[I assume the existing tests are bt### for blast test ###, so mine are 
xbt### for XML blast test ###]

I have also attempted to create a test_NCBIXML.py based on Jeffrey 
Chang's test_NCBIStandalone.py but most of it doesn't seem to apply 
(e.g. there is no _scanner, and as far as I can tell, 
rec.multiple_alignment does not apply to XML blast data).

Any comments?

Once this is pinned down, I want to try and try dealing with some 
RPS-BLAST XML output with NCBIXML.py...

Peter


From mdehoon at c2b2.columbia.edu  Thu Oct 20 18:17:00 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu Oct 20 18:16:03 2005
Subject: [Biopython-dev] KEGG questions
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD62@cgcmail.cgc.cpmc.columbia.edu>

> I've been working on trying to make the KEGG Compound module useable. 
Great! Thanks!

> Before I spend more time on it, I'd like to make sure there isn't a
> more recent version than the one I see in cvs from 2001
> (compound_format.py) and 2004 (__init__.py).

Are you familiar with the Kegg API:
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
This is what the Bioruby folks are using; it may be useful for Biopython
also.

> Would it be appropriate for me to submit my changes at this point? 
> Or would it be best if I kept my changes to myself until I can make
> the compound_format work in a more general and appropriate way?

If you don't have CVS access, please submit changes through the "Bugs" link
on the Biopython website. I'll look at them after the upcoming 1.41 release.
If you do have CVS access, please hold off committing to CVS for a week or so
until the new release is out (to make sure only the stable code gets into the
release).

--Michiel.


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From sbassi at gmail.com  Thu Oct 20 17:34:49 2005
From: sbassi at gmail.com (Sebastian Bassi)
Date: Thu Oct 20 20:26:31 2005
Subject: [Biopython-dev] Biopython new release coming up
In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
Message-ID: <b43bf2080510201434p37bc923eufe60eb382e699655@mail.gmail.com>

On 10/19/05, Michiel De Hoon <mdehoon@c2b2.columbia.edu> wrote:
> It's been a while since our latest Biopython release. I'm planning to put
> together a new release (version 1.41, code-named "Manhattan") and hoping to
> release it by the end of next week. So if you have some code sitting around
> that is stable enough for a Biopython distribution, this would be a good time
> to commit it to CVS.

just to let  you note that Jeronome  j.pansanel@pansanel.net is
working on a VNTI file parser.

Best regards,
SB.

--
<a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
web sin popups ni spyware: Usa Firefox en lugar de Internet
Explorer</a>

From bugzilla-daemon at portal.open-bio.org  Fri Oct 21 13:08:52 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct 21 13:32:12 2005
Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes
Message-ID: <200510211708.j9LH8q6L003537@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885


------- Comment #1 from edmonds@fas.harvard.edu  2005-10-21 13:08 -------
Created an attachment (id=243)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=243&action=view)
New test cases for the KEGG compound format


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct 21 13:11:40 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct 21 13:32:13 2005
Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes
Message-ID: <200510211711.j9LHBe1Q003650@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885


------- Comment #2 from edmonds@fas.harvard.edu  2005-10-21 13:11 -------
Created an attachment (id=244)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=244&action=view)
patch to __init__.py and compound_format.py in Bio/KEGG/Compound


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Oct 21 12:58:45 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Oct 21 13:32:15 2005
Subject: [Biopython-dev] [Bug 1885]  New: KEGG Compound db format changes
Message-ID: <200510211658.j9LGwi71003328@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885

           Summary: KEGG Compound db format changes
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: edmonds@fas.harvard.edu


Several new fields have been added (apparently including mass, comment,
reference, remark, and glycan), structures are represented in a different
format, not all enzymes are listed with a enzyme role component, etc.

I'll post some new test cases and a starting point for a new compound_format.py
that at least won't choke on anything currently in the db.  I've also added a
mass field to record scanner.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From fkauff at duke.edu  Sat Oct 22 15:56:53 2005
From: fkauff at duke.edu (Frank Kauff)
Date: Sat Oct 22 17:03:17 2005
Subject: [Biopython-dev] Biopython new release coming up
In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
Message-ID: <1130011013.4376.6.camel@osiris.biology.duke.edu>

Hi all,

On Wed, 2005-10-19 at 19:23 -0400, Michiel De Hoon wrote:
> Hi everybody,
> 
> It's been a while since our latest Biopython release. I'm planning to put
> together a new release (version 1.41, code-named "Manhattan") and hoping to
> release it by the end of next week. So if you have some code sitting around
> that is stable enough for a Biopython distribution, this would be a good time
> to commit it to CVS.
> Currently, the following tests fail:
>   test_MEME ---> because we don't have the test_MEME output file for
> comparison
>   test_Nexus

Fixed.

Frank


>   test_PDB
>   test_Registry
>   test_SCOP_Astral
> So actually, we're in pretty good shape. If one of these modules is yours,
> please have a look at them and (if possible) try to fix them.
> 
> Also, in Bugzilla, 14 bugs are still open, please have a look to see if there
> is something we can do about them.
> 
> Finally, if a module was significantly changed since the previous release (18
> February 2005), or if a module has been added since then, it would be nice to
> have a summary for the web page.
> 
> Good luck everybody, and may the gods of software development be on our side.
> 
> --Michiel.
> 
> 
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net

From biopython-dev at maubp.freeserve.co.uk  Fri Oct 28 08:33:38 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Fri Oct 28 09:30:58 2005
Subject: [Biopython-dev] RPS-BLAST XML output - Was: Biopython new release
	coming up
In-Reply-To: <4357D683.4070508@maubp.freeserve.co.uk>
References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD5A@cgcmail.cgc.cpmc.columbia.edu>
	<4357D683.4070508@maubp.freeserve.co.uk>
Message-ID: <43621AA2.4010300@maubp.freeserve.co.uk>

Peter wrote:
> In BioPython 1.40b there where no tests of NCBIXML.py
> 
> ...

Thanks due to Michiel for adding these to BioPython with a test script.

> Once this is pinned down, I want to try and try dealing with some 
> RPS-BLAST XML output with NCBIXML.py...

See bug 1715 for parsing the rpsblast txt output:

http://bugzilla.open-bio.org/show_bug.cgi?id=1715

(Summary - I had a cobbled together a working text parser for RPS-BLAST 
2.2.10, but it would need changing for RPS-BLAST 2.2.12)

Attached to this email are two simple examples using standalone 
RPS-BLAST 2.2.10 and 2.2.12, with both txt and XML output.

'xbt006.xml', # Standalone RPS-BLAST 2.2.10, 
gi|49176427|ref|NP_418280.3| - PFAM database
'xbt007.xml', # Standalone RPS-BLAST 2.2.12, 
gi|49176427|ref|NP_418280.3| - PFAM database
'xbt008.xml', # Standalone RPS-BLAST 2.2.10, 
gi|729325|sp|P39483|DHG2_BACME - CDD database

The NCBIXML.py seems to work fine with the standalone RPSBLAST XML 
output.  Note that the online RPS-BLAST does not seem to offer XML 
output at the moment...

One very odd thing I noticed, is that the XML files from RPS-BLAST seem 
to claim to have been produced by blastp:

   <BlastOutput_program>blastp</BlastOutput_program>
   <BlastOutput_version>blastp 2.2.10 [Oct-19-2004]</BlastOutput_version>

Or:

   <BlastOutput_program>blastp</BlastOutput_program>
   <BlastOutput_version>blastp 2.2.12 [Aug-07-2005]</BlastOutput_version>

The text output correctly states RPS-BLAST, so this looks like an 
RPS-BLAST bug which I plan to report...

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xml_rps_blast_test_files.zip
Type: application/x-zip-compressed
Size: 17889 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20051028/ddc47ee3/xml_rps_blast_test_files-0001.bin
From mdehoon at c2b2.columbia.edu  Fri Oct 28 13:55:38 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Fri Oct 28 14:01:29 2005
Subject: [Biopython-dev] CVS freeze for Manhattan release
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD89@cgcmail.cgc.cpmc.columbia.edu>

Hi everybody,

With all biopython tests (*) now passing and the sun shining in Manhattan,
the time has come to put together the next Biopython release. To avoid any
confusion, I'd like to ask you all not to make any commits to CVS until this
release is done (which I will post to the biopython mailing lists).
Thanks!

--Michiel.

(*) Except for the SQL tests. I don't know how to run those. If anybody does,
let me know.


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From mdehoon at c2b2.columbia.edu  Fri Oct 28 20:14:54 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Fri Oct 28 20:13:38 2005
Subject: [Biopython-dev] Biopython release 1.41
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD8B@cgcmail.cgc.cpmc.columbia.edu>

Dear biopythoneers,

We are pleased to announce the release of Biopython 1.41. Many improvements
were made in Biopython during the eight months since the previous release,
and the new release contains lots of bugfixes, improvements, new
functionalities, and better documentation. To pick a few, there's the new
Bio.MEME module by Jason Hackney, updates to the Blast parser using Bertrand
Frottier's NCBIXML code, a BLAT parser by Yair Benita, numerous updates in
Bio.PDB, CompareACE support in AlignAce, and improved user-friendliness in
Bio.Seq.

Lots of people of contributed to this release, in particular Frank Kauff
(Bio.Nexus), Jason Hackney (Bio.MEME), Thomas Hamelryck (Bio.PDB), Fr?d?ric
Sohm (Bio.Restriction), James Casbon (Bio.SCOP) for bug fixes and updates,
Peter (Bio.Blast.NCBIXML test cases), and of course Jeff Chang, Brad Chapman,
Andrew Dalke, and Iddo Friedman for Biopython and the fool-proof instructions
on how to roll a release, which made this a lot easier than I anticipated. My
apologies if I forgot to thank somebody.


--Michiel


Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From bugzilla-daemon at portal.open-bio.org  Sat Oct 29 16:29:41 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  1 15:57:42 2005
Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes
Message-ID: <200510292029.j9TKTfI7011222@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885


------- Comment #3 from mdehoon@ims.u-tokyo.ac.jp  2005-10-29 16:29 -------
How did you download the new test cases for KEGG compound? Are the existing
test cases in Tests/KEGG no longer valid? The submitted patch causes
test_KEGG.py to fail, but I'm not sure if that is due to a bug in the patch or
whether the existing test cases don't satisfy the current KEGG standard.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.