From dalke at acm.org Fri Sep 1 03:16:10 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] XSLT and Martel output Message-ID: <39AF57BA.A43BF7CB@acm.org> Hello, With some pointers from Brad I managed to get an XSLT converter for the Martel SWISS-PROT output into FASTA. I would have tried an XML one, but wasn't sure which to use. The input was the example output file I have at http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt This has 8 records and is about 60K long. The XSLT engine I used is 4XSLT from ForeThought. BTW, it was entirely too complicated to install esp. since there aren't any instructions and there seems to be a missing file from one of the distributions (but which is in the other). :( The actual XSLT text I used is >sp| | Example output looks like: ==== >sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR. MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN >sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS GRCRDDVRCWCTRNC ==== It took about 3.5 seconds to load the file into the DOM and about 1.5 seconds to process it. Since there are 80,000 records in sprot38, it would take nearly 14 hours to convert everything. It would take about 20 minutes to translated it using a SAX-based converter, so a factor of 70 slower. Of course, it would also require that I have enough memory since the DOM I'm using (4DOM, also from ForeThought) keeps everything in RAM. There are some performance things you need to learn using XSLT (or at least tricks specific to this engine.) For example is a lot faster (20-fold or so!) than It's a good thing that FASTA doesn't mandate that all sequence lines (excepting the last) must be 65 characters long. The SWISS-PROT sequence lines are 60 characters long, and I can't figure out how to wrap them to different lengths. On the other hand, it *does* work, and the performance of the engines should go up over time (eg, there is usually about a factor of 5-10 by translation into C). Plus, in theory you should be able to make it work with other XSLT tools. Anyone want to try it with XT, or one of the browsers (does Mozilla or Opera support XSLT?). Better yet, want to start playing around with the BLAST output from Martel? :) Andrew dalke@acm.org From bradmars at yahoo.com Fri Sep 1 13:09:47 2000 From: bradmars at yahoo.com (Bradley Marshall) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Re: [BioXML-dev] XSLT and Martel output Message-ID: <20000901170947.23563.qmail@web208.mail.yahoo.com> It looks great, Andrew. I haven't crunched any numbers, but my gut feeling is that xt (from jclark.com) is prob. 5-10 fold faster than 4XSLT. Unfortunately, 4XSLT is the only python xslt processor that I know of. It's good, but slow. On the plus side, xt works quite nice in jpython. Brad --- Andrew Dalke wrote: > Hello, > > With some pointers from Brad I managed to get an > XSLT converter for > the Martel SWISS-PROT output into FASTA. I would > have tried an XML > one, but wasn't sure which to use. > > The input was the example output file I have at > http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt > This has 8 records and is about 60K long. > > The XSLT engine I used is 4XSLT from ForeThought. > BTW, it was > entirely too complicated to install esp. since there > aren't any > instructions and there seems to be a missing file > from one of > the distributions (but which is in the other). :( > > The actual XSLT text I used is > > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > version="1.0"> > > > disable-output-escaping="yes">>sp| > > disable-output-escaping="yes">| > > > > > > > select="sequence_block/SQ_data_block/SQ_data/sequence"> > > > > > > > > > > > Example output looks like: > ==== > >sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR. > MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP > TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN > > >sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE > PSAS10). > MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS > GRCRDDVRCWCTRNC > ==== > > It took about 3.5 seconds to load the file into the > DOM and about 1.5 > seconds to process it. Since there are 80,000 > records in sprot38, it > would take nearly 14 hours to convert everything. > It would take about > 20 minutes to translated it using a SAX-based > converter, so a factor > of 70 slower. > > Of course, it would also require that I have enough > memory since the > DOM I'm using (4DOM, also from ForeThought) keeps > everything in > RAM. > > There are some performance things you need to learn > using XSLT (or at > least tricks specific to this engine.) For example > select="sequence_block/SQ_data_block/SQ_data/sequence"> > is a lot faster (20-fold or so!) than > > > It's a good thing that FASTA doesn't mandate that > all sequence lines > (excepting the last) must be 65 characters long. > The SWISS-PROT > sequence lines are 60 characters long, and I can't > figure out how to > wrap them to different lengths. > > > On the other hand, it *does* work, and the > performance of the engines > should go up over time (eg, there is usually about a > factor of 5-10 by > translation into C). Plus, in theory you should be > able to make it > work with other XSLT tools. Anyone want to try it > with XT, or one of > the browsers (does Mozilla or Opera support XSLT?). > > Better yet, want to start playing around with the > BLAST output from > Martel? :) > > Andrew > dalke@acm.org > _______________________________________________ > BioXML-dev mailing list - BioXML-dev@bioxml.org > http://bioxml.org/mailman/listinfo/bioxml-dev __________________________________________________ Do You Yahoo!? Yahoo! Mail - Free email you can access from anywhere! http://mail.yahoo.com/ From katel at worldpath.net Tue Sep 5 03:10:17 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Gobase Message-ID: <004201c01708$59752300$010a0a0a@0q6vm> I just committed a Gobase parser. Cayte -------------- next part -------------- An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20000905/2182c586/attachment.htm From bradmars at yahoo.com Tue Sep 5 14:41:09 2000 From: bradmars at yahoo.com (Bradley Marshall) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Re: [BioXML-dev] XSLT and Martel output Message-ID: <20000905184109.20724.qmail@web208.mail.yahoo.com> So I went back and checked the python xml sig mailing list, and fourthought claims that 4XSLT 0.9.2 is up to 100 times faster thann 0.8.2. However, it wasn't available from their web site. There was a link to the rpms, though, and there I found 4XSLT 0.9.2. So, if anybody wants it, it's at : ftp://fourthought.com/pub/mirrors/python4linux/redhat/i386/4XSLT-0.9.2-1.i386.rpm Brad --- Bradley Marshall wrote: > > It looks great, Andrew. > > I haven't crunched any numbers, but my gut feeling > is > that xt (from jclark.com) is prob. 5-10 fold faster > than 4XSLT. Unfortunately, 4XSLT is the only python > xslt processor that I know of. It's good, but slow. > > On the plus side, xt works quite nice in jpython. > > Brad > > --- Andrew Dalke wrote: > > Hello, > > > > With some pointers from Brad I managed to get an > > XSLT converter for > > the Martel SWISS-PROT output into FASTA. I would > > have tried an XML > > one, but wasn't sure which to use. > > > > The input was the example output file I have at > > > http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt > > This has 8 records and is about 60K long. > > > > The XSLT engine I used is 4XSLT from ForeThought. > > BTW, it was > > entirely too complicated to install esp. since > there > > aren't any > > instructions and there seems to be a missing file > > from one of > > the distributions (but which is in the other). :( > > > > The actual XSLT text I used is > > > > > > > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > > version="1.0"> > > > > > > > disable-output-escaping="yes">>sp| > > > > > disable-output-escaping="yes">| > > > > > > > > > > > > > > > > > select="sequence_block/SQ_data_block/SQ_data/sequence"> > > > > > > > > > > > > > > > > > > > > > > > > Example output looks like: > > ==== > > >sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR. > > > MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP > > TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN > > > > >sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR > (CLONE > > PSAS10). > > > MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS > > GRCRDDVRCWCTRNC > > ==== > > > > It took about 3.5 seconds to load the file into > the > > DOM and about 1.5 > > seconds to process it. Since there are 80,000 > > records in sprot38, it > > would take nearly 14 hours to convert everything. > > It would take about > > 20 minutes to translated it using a SAX-based > > converter, so a factor > > of 70 slower. > > > > Of course, it would also require that I have > enough > > memory since the > > DOM I'm using (4DOM, also from ForeThought) keeps > > everything in > > RAM. > > > > There are some performance things you need to > learn > > using XSLT (or at > > least tricks specific to this engine.) For > example > > > > select="sequence_block/SQ_data_block/SQ_data/sequence"> > > is a lot faster (20-fold or so!) than > > > > > > It's a good thing that FASTA doesn't mandate that > > all sequence lines > > (excepting the last) must be 65 characters long. > > The SWISS-PROT > > sequence lines are 60 characters long, and I can't > > figure out how to > > wrap them to different lengths. > > > > > > On the other hand, it *does* work, and the > > performance of the engines > > should go up over time (eg, there is usually about > a > > factor of 5-10 by > > translation into C). Plus, in theory you should > be > > able to make it > > work with other XSLT tools. Anyone want to try it > > with XT, or one of > > the browsers (does Mozilla or Opera support > XSLT?). > > > > Better yet, want to start playing around with the > > BLAST output from > > Martel? :) > > > > Andrew > > dalke@acm.org > > _______________________________________________ > > BioXML-dev mailing list - BioXML-dev@bioxml.org > > http://bioxml.org/mailman/listinfo/bioxml-dev > > > __________________________________________________ > Do You Yahoo!? > Yahoo! Mail - Free email you can access from > anywhere! > http://mail.yahoo.com/ > _______________________________________________ > BioXML-dev mailing list - BioXML-dev@bioxml.org > http://bioxml.org/mailman/listinfo/bioxml-dev __________________________________________________ Do You Yahoo!? Yahoo! Mail - Free email you can access from anywhere! http://mail.yahoo.com/ From jchang at SMI.Stanford.EDU Wed Sep 6 00:42:43 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Gobase In-Reply-To: <004201c01708$59752300$010a0a0a@0q6vm> Message-ID: Great! I noticed that you created a suite of regression tests. Could you also commit a hand-verified file for Tests/output/test_gobase? I nothing catastrophic happens, I'd like to put together a new build tomorrow afternoon (PST). If the file doesn't get in before then, that's ok too. Thanks, Jeff On Tue, 5 Sep 2000, Cayte wrote: > I just committed a Gobase parser. > > Cayte > From jchang at SMI.Stanford.EDU Wed Sep 6 18:32:51 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Biopython 0.90d03 released Message-ID: Hello everybody, Biopython 0.90d03 is now available at: http://www.biopython.org/Download/ Changes from the previous version are: Blast updates: - bug fixes in NCBIStandalone, NCBIWWW - some __str__ methods in Record.py implemented (incomplete) Tests - new BLAST regression tests - prosite tests fixed New parsers for Rebase, Gobase pure python implementation of C-based tools Thomas Sicheritz-Ponten's xbbtools can now generate documentation from docstrings using HappyDoc The tests for prodoc and rebase are not working yet, so if you run the regression tests, those two should fail, but the other 14 should work. Enjoy, and keep those bug reports, feature requests, patches, new modules, coming in! Jeff From dagdigian at ComputeFarm.com Wed Sep 13 13:21:50 2000 From: dagdigian at ComputeFarm.com (Chris Dagdigian) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] bio*.org server bandwidth upgrade on the horizon Message-ID: <4.3.2.7.0.20000913131410.00ad1b20@fedayi.sonsorol.org> I just received word that the net connection for the bio*.org server(s) is going to be upgraded from T1 to a T3 line. Given the insane lead time for telecommunication orders the best timeframe I have at this time is that the work will be completed sometime before the end of the year. I'll provide more info as I get it especially if it involves significant downtime for us. -Chris From katel at worldpath.net Sun Sep 17 02:35:52 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel Message-ID: <005801c02071$8721b3a0$010a0a0a@cadence.com> My impression of Martel is that it will require extensive testing, because it has so many paths. The tests cover the basic expressions, but I'd be surprised if there are no weird interactions. The code may lose its context, on complicated paths. I could help with adding unit tests. In a few cases, I think the names need to be more descriptive. Variables like p, s or av don't give a lot of information. Also, the name "pattern" is used for too many things, that have different meanings. The regular expression is sometimes "source" and sometimes "s". At least, I, need all the help I can get, navigating the recursion.:)! An example of a construct that confused me is: x = sre_parse.parse(s, pattern = MultigroupPattern()) return convert_list(x.pattern, x) Finding self-documenting names can be hard, but sometimes the effort to find the right metaphor clarifies your thinking. Cayte -------------- next part -------------- An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20000916/ecb91433/attachment.htm From dalke at acm.org Sun Sep 17 14:39:34 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel Message-ID: <003201c020d6$a1ce0ea0$359c343f@josiah> Cayte > My impression of Martel is that it will require extensive testing, because > it has so many paths. The tests cover the basic expressions, but I'd be > surprised if there are no weird interactions. The code may lose its > context, on complicated paths. I could help with adding unit tests. One of the things I found during development was that it was almost impossible to write a parser without testing each of the components against real text. What you are seeing is the support framework needed for that. Concerning the number of paths; I'm not sure which paths you're talking about. There are two I can think of. One is the generation of the state table for mxTextTools and the other is the evaluation of the text through that state table. The first is somewhat straight-forward; very much like unoptimized code generation from a parse tree. It does need documentation so others can verify my work. The second is indeed more complicated, but it should be almost identically complicated to hand written parser code of equivalent abilities. Debugging, btw, is also somewhat complicated because failures are identified as the last character that something worked, as compared to the last character which was used for a test. I need to take a look at the mxTextTools code to see if there's a way to give better position information. > In a few cases, I think the names need to be more descriptive. Variables > like p, s or av don't give a lot of information. Also, the name "pattern" > is used for too many things, that have different meanings. You're missing a few other naming clashes in my code. I agree, it needs a full cleanup before it is of good enough quality that I would foist it off on most people. The names are confusing because I was confused myself when doing the code. I was working with a couple of toolsets (sre_parse and mxTexTools) which I hadn't used before, and I was changing my idea of how things should be done based on what I learned using them. (Not an excuse, just history, and they do need to get fixed.) There are two major reasons why I haven't fixed things. One is, alas, the lack of time. The other is that there are a few changes I need to make to support certain formats and needs. I've added a "named group repeat" where a named group can be used as the repeat count for later groups. (This is needed for MDL's CT format, which gives the atom and bond counts then "atom_count" lines of atom records and "bond_count" lines of bond records.) I also need to redo how it handles files so I can feed it a record at a time rather than the whole data file, but without changing the SAX events. (Jeff first suggested this one.) So I'm still in the experimental phase to see what other changes are needed, and I'm hoping to get feedback from others about it. Thus, I haven't wanted to go through the code cleaning it up until I know more about what to change. > Finding self-documenting names can be hard, but sometimes the effort > to find the right metaphor clarifies your thinking. Yep, and yep. Andrew dalke@acm.org From katel at worldpath.net Sun Sep 17 23:45:48 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel References: <003201c020d6$a1ce0ea0$359c343f@josiah> Message-ID: <001f01c02122$efa75720$010a0a0a@cadence.com> When I ran _test in Generate.py, I received this message: Traceback (innermost last): File "test_generate.py", line 3, in ? Generate._test() File "Generate.py", line 471, in _test exp = _generate(convert_re.make_expression(re_pat)) TypeError: not enough arguments; expected 2, got 1 convert_re.make_expression returns the results from convert_list. convet_list returns an Expression.Seq object that is passed to _generate. Cayte From dalke at acm.org Tue Sep 19 19:52:44 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel Message-ID: <003201c02294$b6059fe0$399c343f@josiah> Cayte: > When I ran _test in Generate.py, I received this message: > >Traceback (innermost last): > File "test_generate.py", line 3, in ? > Generate._test() > File "Generate.py", line 471, in _test > exp = _generate(convert_re.make_expression(re_pat)) >TypeError: not enough arguments; expected 2, got 1 Oops! Yeah, if I don't make all of the tests accessible from one spot I forgot the run them. I changed the API after I wrote that test. It can be fixed by using {} at the second parameter. exp = _generate(convert_re.make_expression(re_pat), {}) The second parameter is a dictionary of names needed for group references, like the \1 in r"(?P....)\1". (It's a dict instead of a list because I like the O(1) lookup performance, and because the parameter is not exposed as part of the API.) I'm cleaning up the code now, including changing the names to be more consistent. I'll include moving all of the tests to the test directory instead of including them in the module. Andrew dalke@acm.org From katel at worldpath.net Thu Sep 21 04:53:08 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel References: <003201c02294$b6059fe0$399c343f@josiah> Message-ID: <007401c023a9$5e112140$010a0a0a@cadence.com> ----- Original Message ----- From: "Andrew Dalke" To: Sent: Tuesday, September 19, 2000 4:52 PM Subject: Re: [Biopython-dev] Martel > Oops! Yeah, if I don't make all of the tests accessible from one > spot I forgot the run them. I changed the API after I wrote that > test. It can be fixed by using {} at the second parameter. > > exp = _generate(convert_re.make_expression(re_pat), {}) > It works with the patch. But when I pasted in some regexps from the perl EMBL.pm, it rejected these constructs. I can't send the perl expressions in this message , because the email software interprets the backslashes as the prefix of a url. Cayte From dalke at acm.org Thu Sep 21 05:27:33 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel Message-ID: <002901c023ae$2ce10540$0680343f@josiah> Cayte: > In a few cases, I think the names need to be more descriptive. Variables > like p, s or av don't give a lot of information. I've been cleaning up the code over the last couple of days, and adding docstring and comments. I haven't gone through all my code yet, but I think the case you're seeing is the sre_parse.py code. I grabbed that module and sre_constants.py from the 1.6 distribution. It was written by Fredrick Lundh and I don't have much control over it. I have made some changes to sre_parse.py, but I've tried to minimize those changes to make it easier to stay in synch with future changes. On the other hand, the sre code is being tested by a lot of people, so there shouldn't need to be many tests for it except for the changes I've added. (Those changes are marked with an 'APD'.) > Also, the name "pattern" is used for too many things, that have different > meanings. The regular expression is sometimes "source" and sometimes "s". Again, that appears to be the sre_parse code. I have cleaned up my code to distinguish between a regular expression in the abstract and its representation as a "pattern" string and an "expression" tree. My next project is to clean up 'Generate.py' which has the regexp represented as a tagtable. > An example of a construct that confused me is: > > x = sre_parse.parse(s, pattern = MultigroupPattern()) > return convert_list(x.pattern, x) I changed the 'MultigroupPattern' class name to 'GroupName'. There's nothing I can really do with the "pattern = " and the "x.pattern" code, since that's the way sre_parse wants it. I did add some documentation beforehand saying to basically ignore the names :) I'm giving a presentation to people tomorrow (oops! today!) about Martel. They are chemistry people and will want to see support for MDL's file formats. That wasn't in the previous release so I've made a 0.25 release which contains that format and the cleanups I've done to date. It's at http://www.biopython.org/~dalke/Martel/Martel-0.25.tar.gz . Also, all of the regression tests are now runnable from test/__init__.py. Version 0.3 will be the release with the complete code cleanup (excluding the two sre_*.py files) Andrew dalke@acm.org From dalke at acm.org Mon Sep 25 22:42:06 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] sre bug(s) Message-ID: <00c601c02763$7f909040$2980343f@josiah> Sigh. Looks like I'll update the sre_parse.py in Martel tomorrow once 2.0b2 is released tomorrow. Andrew -----Original Message----- From: Fredrik Lundh Newsgroups: comp.lang.python Date: Monday, September 25, 2000 4:12 PM Subject: Re: You can never go down the drain... >Phlip wrote: >> C> I'd search the bugbase for this, to see if 1.6 had it, but I have no idea >> how to search for things like [] and \1 and sub from a web page. > >it's bug 114660 > >> 4> Just for curiosity, any re workaround that's obvious to all you >> expression regulars? (Besides a loop statement?) > >here's the patch (the line numbers might differ slightly from >your copy) > >Index: sre_parse.py >=================================================================== >RCS file: /cvsroot/python/python/dist/src/Lib/sre_parse.py,v >retrieving revision 1.33 >retrieving revision 1.34 >diff -C2 -r1.33 -r1.34 >*** sre_parse.py 2000/09/02 11:03:33 1.33 >--- sre_parse.py 2000/09/24 14:46:19 1.34 >*************** >*** 635,639 **** > group = _group(this, pattern.groups+1) > if group: >! if (not s.next or > not _group(this + s.next, pattern.groups+1)): > code = MARK, int(group) >--- 635,639 ---- > group = _group(this, pattern.groups+1) > if group: >! if (s.next not in DIGITS or > not _group(this + s.next, pattern.groups+1)): > code = MARK, int(group) > > > From roybryant at SEVENtwentyfour.com Tue Sep 26 10:59:10 2000 From: roybryant at SEVENtwentyfour.com (Roy Bryant) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Broken link in www.biopython.org Message-ID: There appears to be a problem on this page of your site. On your page http://www.biopython.org/wiki/html/BioPython/BioCorba.html when you click on your link to http://www.biopython.org/Download/) you get the error: Not found As recommended by the Robot Guidelines, this email is to explain our robot's activities and to let you know about one of the broken links we encountered. LinkWalker does not store or publish the content of your pages, but rather uses the link information to update our map of the World Wide Web. Are these reports helpful? I'd love some feedback. If you prefer not to receive these occasional error notices please let me know. Roy Bryant ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Bryant, roybryant@seventwentyfour.com President SEVENtwentyfour Inc. ("Always watching the Web") http://www.seventwentyfour.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From dalke at acm.org Thu Sep 28 02:43:00 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] lost my Jitterbug password Message-ID: <02a701c02917$5928b300$ec7f343f@josiah> What should I do? BTW, I fixed bug 11, "tranlate by name" and added a test for it in test_translate.py . Andrew From thomas at cbs.dtu.dk Fri Sep 29 03:35:34 2000 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Background process handling Message-ID: Hej Biopythoner's, I need to include a pure python solution of background database searches in the 'xbbtools' program ... but I am not sure how to implement that. I want to start one or several Blast searches from the graphical sequence editor. The individual results should be continuously updated in different windows (one per blast search). The different windows should signal (maybe by changing background color) when the blast search is finished. The user can stop a blast search simply by destroying the associated window. Of course, all windows shall be updated continuously and the user should not feel any lag in the main editor window. How should I solve this ? a) fork and exec* b) popen c) write to temporary file, start blast into new file, continuously read new file d) use an expect module e) threads f) a combination with a LOT of updates ? g) ??? Any suggestions ? thx -thomas -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@bioinformatics.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From thomas at cbs.dtu.dk Thu Sep 28 15:34:21 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Background process handling Message-ID: <14803.40253.441744.109557@bb1.home> Hej Biopythoner's, I need to include a pure python solution of background database searches in the 'xbbtools' program ... but I am not sure how to implement that. I want to start one or several Blast searches from the graphical sequence editor. The individual results should be continuously updated in different windows (one per blast search). The different windows should signal (maybe by changing background color) when the blast search is finished. The user can stop a blast search simply by destroying the associated window. Of course, all windows shall be updated continuously and the user should not feel any lag in the main editor window. How should I solve this ? a) fork and exec* b) popen c) write to temporary file, start blast into new file, continuously read new file d) use an expect module e) threads f) a combination with a LOT of updates ? g) ??? Any suggestions ? thx -thomas Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@bioinformatics.org The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From antoine at egenetics.com Fri Sep 29 04:34:48 2000 From: antoine at egenetics.com (Antoine van Gelder) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Background process handling References: <14803.40253.441744.109557@bb1.home> Message-ID: <39D45428.86E54ED2@egenetics.com> thomas@cbs.dtu.dk wrote: > How should I solve this ? > a) fork and exec* > b) popen > c) write to temporary file, start blast into new file, continuously read > new file > d) use an expect module > e) threads > f) a combination with a LOT of updates ? > g) ??? In the Stackpack EST clustering pipeline I use a thread wrapped around popen to fire off jobs that are expected to take some time. Main program updates can be handled either through polling the thread (not so good) or a callback from the thread (much better) :> - antoine From dalke at acm.org Fri Sep 29 06:20:35 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Background process handling Message-ID: <044401c029fe$ed8dd060$ec7f343f@josiah> Thomas Sicheritz-Ponten : >How should I solve this ? >a) fork and exec* >b) popen >c) write to temporary file, start blast into new file, continuously read > new file >d) use an expect module >e) threads >f) a combination with a LOT of updates ? >g) ??? There are two usual approaches - select based and thread based. "select" is a mechanism to tell if something happend on a file handle. Under unix, nearly everything is a file handle (files, network I/O, X). Under Windows it only works with sockets. See the select module. Using selects in a command line application works something like this. Have a central list of "jobs", each of which is a select'able object. (Warning: you are limited to the number of file descriptors on a machine, which also includes stdin, stdout and stderr. On some machines this may be 64 or lower, though lower is quite rare these days.) The outermost loop of your program does a select on the task list, to see which had changes. From this it maps the activity information to an action, which is most likely a callback for that object. The function can read text from the descriptor, remove the task from the list of tasks, or whatever. With a GUI things become a bit more complicated. Some GUIs want to be the main event loop, but realize that other people use select based multitasking, so provide a way to register file descriptors and callbacks. Other GUIs act more like a library, and give you a way to get a (possible list of) file descriptor for the GUI, which you use for your event loop. I believe Tk is of the first form, but I've never really looked into it. The GUI documentation should go into the details. The select approach can be used with a) os.popen, b) fork/exec (see the popen2 module for one way) and c) reading a file using the regular open. Actually, b) is used as the basis for both a) and the system call you need for c). I've never used d) so cannot comment. If you really want to get into select based systems, take a look at Sam Rushing's Medusa, part of which is included in Python as asyncore and asynchat. The Design Pattern for this approach is, I believe, called the "Reactor." The other usual approach, and the one often considered more modern, is to use threads. This is what you almost must do if you want to run under MS Windows. Threads is to select as preemtive multitasking is to non-preemptive. The mechanism for threads is conceptually simpler than select: "start this function and let it do whatever it needs to do while I work on other things." L ikely you will want to create a thread task object which takes the BLAST input parameters and runs blast. The thread will use the same methods as select (os.popen, fork/exec, etc.) but instead of using select to tell if the status changed, it just sits there waiting for input. It can do this since the thread library will run other threads to prevent the program from completely halting. The downside of threads used to be that most application code, its libraries and even POSIX calls weren't all thread safe. POSIX added some new functions (the "*_r" ones) to fix the problems, and many libraries are thread safe. Still, some aren't and so things like Tk must be dealt with specially to keep all the Tk calls in a single thread. That doesn't prevent you from writing non-thread safe code, or using libraries (like biopython?) which aren't thread safe. You start having to worry about how to serialize library calls so that you don't trigger problems. Hint: use the higher level primitives for threading, like Queue. Debugging becomes more complicated because if there are timing problems, like non-thread safe libraries, you can't always get a good reproducible. I tend to write my threaded objects with a very state machine like behaviour so that I can make good guarantees about when and how it should be used. (This is a good programming style in general.) Also, Python's core is only thread safe at the coarse grained level. There is a single, global interpreter lock which prevents two pieces of Python code from running at the same time. The lock is rescended every so often to allow multiple threads to work. However, this is not a problem with you since you aren't interested in threads as a way to increase compute performance. It used to be that there were a lot of timing problems because the thread libraries were buggy, but those Given all of this, I suggest using threads. It's an easier programming model (even given the possible non-thread safe parts), works on Unix and MS Windows, and there are now more people with thread development experience than select. It looks like Antoine is one to ask :) Here's a sketch of one way to write your code using threads. It assumes all GUI events are serialized in one thread, which is the main one. class BlastWindow: def __init__(self, gui_change): self.gui_change = gui_change self.result = None def set_results(self, result): # using the caller's thread, not the GUI thread, so set the # data but don't do anything using the GUI until called later self._result = result self.gui_change.put(self) def do_change(self): # the BLAST run is finished, so get the result data and use it to # update the window self.result = result del self._result # change GUI ... class BlastTask(threading.Thread): def __init__(self, blast_params, window): threading.Thread.__init__(self) self.window = window ... def run(self): # set up the tmpdir and files like .ncbirc, etc. os.system("cd tmpdir; blast -i ..") # no error checking for now self.window.set_results(blast_parse(open("tmpdir/blast.output"))) gui_change = Queue(-1) # used to serialize GUI updates app = App(gui_change) window = app.createBlastWindow() ... blast = BlastTask(blast_params, window) ... while 1: change = gui_change.get() if change == : # however you define an "exit" break change.do_change() Andrew dalke@acm.org From chapmanb at arches.uga.edu Sat Sep 30 12:55:57 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Martel stuff Message-ID: <200009301655.MAA46310@archa10.cc.uga.edu> Hey all; I have a few things relating to Martel: 1. I have a bit of a problem with something in the Clustalw parsing that I just checked in (Bio/Clustalw/clustal_format.py). The newer clustalw formatted files have these annoying starts that are in the file (I call them match_stars in the format file). I just realized that these stars aren't always there, so I made the following change to try and make them optional: --- clustal_format.py.orig Thu Sep 28 19:49:37 2000 +++ clustal_format.py Sat Sep 30 12:41:10 2000 @@ -59,7 +59,7 @@ block_info = Martel.Group("block_info", Martel.Rep(seq_line) + - match_stars + + Martel.MaxRepeat(match_stars, 0, 1) + Martel.MaxRepeat(new_block, 0, 1)) I think this is right, but when I do this it makes the parse hang and never finish. Hmmm.... I'm not sure how to debug this, any ideas? 2. I just installed 2.0b2, and it looks like we'll need the PyXML package :-< Python2.0 doesn't seem to come with saxlib, which we need to implement handler classes for the XML produced by Martel. The standard xml library also doesn't have saxexts/sax2exts, and seems to have some other differences from the PyXML package. Once the next version of PyXML (0.6.1, I think) which is supposed to work with b2 comes out, I guess I can see how well this works with what is in the standard library. Anyways, I think this is the situation with python2.0. I'm not sure what thoughts are about this... 3. What are people's thoughts about integrating Martel more tightly with Biopython? Do you think it would be worthwhile for me to try my hand at implementing a Martel based Fasta parser that would work with the code Jeff has already got in place? Thanks for listening! Brad From dalke at acm.org Sat Sep 30 13:50:44 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel stuff Message-ID: <071401c02b06$f737c7c0$ec7f343f@josiah> Brad: > I think this is right, but when I do this it makes the parse hang and > never finish. Hmmm.... I'm not sure how to debug this, any ideas? The code looks correct, except you should use "Opt(expr)" as a shorthand for "MaxRepeat(expr, 0, 1)". The hang you are seeing is likely a problem with Martel. Suppose it needs to match 0 or more times, and one of the matches can be of size 0. Then it will set on that spot forever, continuously eating groups of size 0. The best way to work around the problem is to make sure that all repeat groups are guaranteed to be able to consume a character. Another work around is to put an upper limit on the repeat count. Once I get this next release out, I'll see about generating tag tables which check the size of any match. There will be quite a bit of overhead for doing that, so I'm thinking of having a debug version which would handle this and be better able to pinpoint error positions. > it looks like we'll need the PyXML package :-< Python2.0 doesn't seem to > come with saxlib, which we need to implement handler classes for the XML > produced by Martel. What about xml.sax.handler ? I haven't sat down with the new Python distro to see what's changed. Again, that will wait until after I get this 0.3 release out. > 3. What are people's thoughts about integrating Martel more tightly with > Biopython? Jeff says that he's for it. I just need to (again ): get this release out so people can start testing it. > Do you think it would be worthwhile for me to try my hand at > implementing a Martel based Fasta parser that would work with the code > Jeff has already got in place? Yes, and no. The biggest change for 0.3 is support for hybrid parsers, which uses a simple reader to grab a record at a time, then passes that to Martel for in-depth parsing. This reduces the amount of memory needed to parse a file. So the "yes" part means, go ahead and write a parser for FASTA which produces Biopython data structures. However, it will likely change for the future. In fact, for FASTA I would probably have the regexp available, so it can be merged with other expressions, but have it create a scanner which is pure Python generating the SAX events, rather than going through mxTextTools. Andrew From chapmanb at arches.uga.edu Sat Sep 30 15:26:55 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel stuff In-Reply-To: <071401c02b06$f737c7c0$ec7f343f@josiah> Message-ID: <200009301926.PAA121030@archa11.cc.uga.edu> > The hang you are seeing is likely a problem with Martel. Suppose it > needs to match 0 or more times, and one of the matches can be of size 0. > Then it will set on that spot forever, continuously eating groups of size 0. Aha! Thanks! The solution was to use Rep1 where I want to be guaranteed to get a match (instead of all Rep like I was doing previously) and this stopped the hanging. Thanks for the pointer on that. [XML in python2.0] > What about xml.sax.handler ? Doh! You're right -- we can use handler and get things to work properly. Thanks! In addition, there is a small change that needs to happen in Generate.py to make things fully work (instead of using xml.sax.saxlib.SAXException, using xml.sax._exceptions.SAXException). But after that things seem to work! Snazzy!