From katel at worldpath.net Sat Dec 1 22:39:02 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Align Message-ID: <000701c17ae2$e42349c0$010a0a0a@cadence.com> The sig in Generic.Alignment function, add_sequence, def add_sequence(self, descriptor, sequence, start = None, end = None, weight = 1.0): does not allow the caller to pass in the name of the sequence. I think the descriptor should have a default of the empty string and the name should be part of the signature. I need the Alignment class for the SAF parser, because the SAF format represents alignments rather than isolated sequences. Cayte From idoerg at cc.huji.ac.il Mon Dec 3 05:22:20 2001 From: idoerg at cc.huji.ac.il (Iddo Friedberg) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Server request Message-ID: Hi all, I am not sure to whom this request should be addressed, but as this may be of general interest to most people on the list, I am putting it here. I have recently completed the first stage of what I call the PeCoP ("pea-cop") server. Briefly, the user enters a sequence, and receives an annotated output of conserved positions, as determined by multiple PSI-BLAST runs. Due to some recent manpower reshuffle, my faculty is not-that-equipped to handle mounting of CGI-script driven pages. So I'm bumming around. Any chance of getting this hosted on biopython.org? As PeCoP drives a modified version of standalone PSI-BLAST, it needs the following: 1) An installed standalone version of "my" PSI-BLAST (blastpgpI). Probably my binary, compiled on Linux 2.2.16-22 will work. If not, I can always recompile. 2) The biggie: NCBI datbase versions of sequence databases. Currently I use nrgb, whose size is in the 390MB region. Actually, that's an old nrgb. The latest version is probably a bit larger. For now, biopython is used in a very rudimentary fashion in PeCoP. Parsing fasta format, mainly. But the use of it will grow as I add features... Takers? Iddo -- Iddo Friedberg | Tel: +972-2-6757374 Dept. of Molecular Genetics and Biotechnology | Fax: +972-2-6757308 The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il POB 12272, Jerusalem 91120 | Israel | http://bioinfo.md.huji.ac.il/marg/people-home/iddo/ From chapmanb at arches.uga.edu Thu Dec 6 08:08:20 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Align In-Reply-To: <000701c17ae2$e42349c0$010a0a0a@cadence.com> References: <000701c17ae2$e42349c0$010a0a0a@cadence.com> Message-ID: <20011206080820.C45321@ci350185-a.athen1.ga.home.com> Hi Cayte; > The sig in Generic.Alignment function, add_sequence, > def add_sequence(self, descriptor, sequence, start = None, end = None, > weight = 1.0): > > does not allow the caller to pass in the name of the sequence. Hmmm... this is what the "descriptor" argument is supposed to be for. Do you need to pass in more than just this? Once thing we could do is add an additional function along the lines of: def add_seq_record(self, record, start = None, end = None, weight = 1.0): which would allow you to build up a SeqRecord object with names, annotations, features and whatever else, and then add it into an alignment. Would this solve your problem? Brad From idoerg at cc.huji.ac.il Thu Dec 6 10:55:42 2001 From: idoerg at cc.huji.ac.il (Iddo Friedberg) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] SubsMat CVS update In-Reply-To: <20011206080820.C45321@ci350185-a.athen1.ga.home.com> Message-ID: Hi, 1) Fixed a bug in SubsMat. Half-matrices are no longer generated automatically in the class constructor. 2) Fixed the "different-float-representations-on-different-platforms" bug(?). Now let us hope that all look at integers in the same fashion :) Iddo -- Iddo Friedberg | Tel: +972-2-6757374 Dept. of Molecular Genetics and Biotechnology | Fax: +972-2-6757308 The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il POB 12272, Jerusalem 91120 | Israel | http://bioinfo.md.huji.ac.il/marg/people-home/iddo/ From jchang at smi.stanford.edu Thu Dec 6 13:05:18 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] SubsMat CVS update In-Reply-To: References: <20011206080820.C45321@ci350185-a.athen1.ga.home.com> Message-ID: <20011206100518.A374@krusty.stanford.edu> Thanks! These notes are helpful for people working with this code, and also helpful for me when I generate the release notes for the next release. Jeff On Thu, Dec 06, 2001 at 05:55:42PM +0200, Iddo Friedberg wrote: > Hi, > > 1) Fixed a bug in SubsMat. Half-matrices are no longer generated > automatically in the class constructor. > > 2) Fixed the "different-float-representations-on-different-platforms" > bug(?). Now let us hope that all look at integers in the same fashion :) > > Iddo > > -- > > Iddo Friedberg | Tel: +972-2-6757374 > Dept. of Molecular Genetics and Biotechnology | Fax: +972-2-6757308 > The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il > POB 12272, Jerusalem 91120 | > Israel | > http://bioinfo.md.huji.ac.il/marg/people-home/iddo/ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev From gec at compbio.berkeley.edu Thu Dec 6 17:54:36 2001 From: gec at compbio.berkeley.edu (Gavin E. Crooks) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Failed Tests Message-ID: <01120614593201.13517@sienna.berkeley.edu> I now have 3 tests failing. (Which is alot better than a month ago.) test_intelligenetics and test_metatool still fail, as does test_nbrf. Gavin Crooks gec@compbio.berkeley.edu http://threeplusone.com ====================================================================== ERROR: test_intelligenetics ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 136, in runTest __import__(self.test_name) File "test_intelligenetics.py", line 29, in ? src_handle = open( datafile ) IOError: [Errno 2] No such file or directory: 'IntelliGenetics/TAT_mase_nuc.txt'====================================================================== ERROR: test_metatool ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 136, in runTest __import__(self.test_name) File "./test_metatool.py", line 29, in ? src_handle = open( datafile ) IOError: [Errno 2] No such file or directory: 'MetaTool/meta9.out' ====================================================================== ERROR: test_nbrf ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 136, in runTest __import__(self.test_name) File "test_nbrf.py", line 6, in ? import Bio.NBRF File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/NBRF/__init__.py", line 24, in ? import nbrf_format File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/NBRF/nbrf_format.py", line 40, in ? from Bio.NBRF.ValSeq import valid_sequence_dict ImportError: No module named ValSeq ---------------------------------------------------------------------- Ran 32 tests in 76.650s From chapmanb at arches.uga.edu Fri Dec 7 10:20:05 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <01120614593201.13517@sienna.berkeley.edu> References: <01120614593201.13517@sienna.berkeley.edu> Message-ID: <20011207102005.A47951@ci350185-a.athen1.ga.home.com> Hi Gavin; > I now have 3 tests failing. (Which is alot better than a month ago.) I'm glad it improved :-). I did some cross-version, cross-platform work on the tests, so it's good that I actually fixed some tests. > test_intelligenetics and test_metatool still fail, as does test_nbrf. Hmmm, these are all Cayte's tests, and it looks like all of the failures are due to non-committed files (NBRF was not in the setup.py file, which I fixed, but it still failts after that). Cayte, could you do a checkout of the CVS code in a fresh directory, and make sure that you've committed all of the files for these test and modules? It looks like all of the problems are missing files, which you probably have in your local working directory, but you haven't committed to the CVS repository. Thanks Gavin for the heads up and making us look at this! Brad From jchang at smi.stanford.edu Fri Dec 7 12:08:32 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:07 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <20011207102005.A47951@ci350185-a.athen1.ga.home.com> References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com> Message-ID: <20011207090832.A644@krusty.stanford.edu> People, please check to make sure your regression tests are working before you check them in. Having regression tests that always fail is worse than having no regression tests and causes a lot of wasted time. Thanks for Gavin and Brad for checking into this! Jeff From katel at worldpath.net Fri Dec 7 18:36:31 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com> Message-ID: <000701c17f78$01453380$010a0a0a@cadence.com> ----- Original Message ----- From: "Brad Chapman" To: Sent: Friday, December 07, 2001 7:20 AM Subject: Re: [Biopython-dev] Failed Tests > Hi Gavin; > > > I now have 3 tests failing. (Which is alot better than a month ago.) > > I'm glad it improved :-). I did some cross-version, cross-platform work > on the tests, so it's good that I actually fixed some tests. > > > test_intelligenetics and test_metatool still fail, as does test_nbrf. > > Hmmm, these are all Cayte's tests, and it looks like all of the failures > are due to non-committed files (NBRF was not in the setup.py file, which > I fixed, but it still failts after that). > > Cayte, could you do a checkout of the CVS code in a fresh directory, and > make sure that you've committed all of the files for these test and > modules? It looks like all of the problems are missing files, which you > probably have in your local working directory, but you haven't committed > to the CVS repository. > I plan to look into it. I was planning to fix it yesterday but my keyboard konked out, requiring a run to ompUSA. Cayte From katel at worldpath.net Fri Dec 7 18:54:21 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com> <20011207090832.A644@krusty.stanford.edu> Message-ID: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: Sent: Friday, December 07, 2001 9:08 AM Subject: Re: [Biopython-dev] Failed Tests > People, please check to make sure your regression tests are working > before you check them in. Having regression tests that always fail is > worse than having no regression tests and causes a lot of wasted time. > > Thanks for Gavin and Brad for checking into this! > MetaTool worked on my system because its Windows/Dos which is not case sensitive. Ideally I should run these tests on the Unix system but I'm queasy about running it since I don't own the computer. ( At work we have a MITS department to fix crashed computers ). The tests worked on my local system. Also in the CVS docs I didn't see a tree listing command. This would help a lot in checking for missing uploads if someone knows what this is. I think Tarjei added ValSeq since the tests. Cayte Cayte / From tarjei at genome.wi.mit.edu Fri Dec 7 16:09:19 2001 From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> Message-ID: <000401c17f63$6fb35c80$67135512@mit.edu> >I think Tarjei added ValSeq since the tests. ValSeq? Nope, not me... - Tarjei From katel at worldpath.net Fri Dec 7 19:10:40 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests References: <000401c17f63$6fb35c80$67135512@mit.edu> Message-ID: <003a01c17f7c$c677d960$010a0a0a@cadence.com> ----- Original Message ----- From: "Tarjei Mikkelsen" To: "'Cayte'" ; "'Jeffrey Chang'" ; Sent: Friday, December 07, 2001 1:09 PM Subject: RE: [Biopython-dev] Failed Tests > >I think Tarjei added ValSeq since the tests. > > ValSeq? Nope, not me... > > > - Tarjei I'll check it in anyway. Cayte From katel at worldpath.net Fri Dec 7 19:25:29 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests References: <000401c17f63$6fb35c80$67135512@mit.edu> Message-ID: <004601c17f7e$d83bf760$010a0a0a@cadence.com> ----- Original Message ----- From: "Tarjei Mikkelsen" To: "'Cayte'" ; "'Jeffrey Chang'" ; Sent: Friday, December 07, 2001 1:09 PM Subject: RE: [Biopython-dev] Failed Tests > >I think Tarjei added ValSeq since the tests. > > ValSeq? Nope, not me... > > Sorry, I was thinking of the Pathway/metatool stuff. Cayte From gec at compbio.berkeley.edu Fri Dec 7 15:55:14 2001 From: gec at compbio.berkeley.edu (Gavin E. Crooks) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> References: <01120614593201.13517@sienna.berkeley.edu> <20011207090832.A644@krusty.stanford.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> Message-ID: <01120713045402.13517@sienna.berkeley.edu> > MetaTool worked on my system because its Windows/Dos which is not case > sensitive. > Ideally I should run these tests on the Unix system but I'm queasy about > running it since I don't own the computer. ( At work we have a MITS > department to fix crashed computers ). The tests worked on my local system. One possible improvement would be to use Continuous Integration. http://www.martinfowler.com/articles/continuousIntegration.html For example, we could have a daemon that runs once a day. It would do a clean installation, build and test of biopython, and send out warning emails if anything goes wrong. This would ensure that tests never stay broken for very long. I kind of like this idea, and I may have a go at implementing it sometime. Gavin From gec at compbio.berkeley.edu Fri Dec 7 16:15:35 2001 From: gec at compbio.berkeley.edu (Gavin E. Crooks) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <01120713045402.13517@sienna.berkeley.edu> References: <01120614593201.13517@sienna.berkeley.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu> Message-ID: <01120713202303.13517@sienna.berkeley.edu> test_metatool is now giving an even more mysterious error message!? Gavin ====================================================================== ERROR: test_metatool ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 136, in runTest __import__(self.test_name) File "./test_metatool.py", line 32, in ? print data File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaTool/Record.py", line 119, in __ str__ File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaTool/Record.py", line 51, in __str__ if( self.matrix != None ): File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Numeric/UserArray.py", line 163, in __ne__ def __ne__(self,other): return self._rc(not_equal(self.array,other)) SystemError: Objects/object.c:727: bad argument to internal function ---------------------------------------------------------------------- Ran 1 tests in 1.541s From jchang at smi.stanford.edu Fri Dec 7 16:55:40 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests In-Reply-To: <01120713045402.13517@sienna.berkeley.edu> References: <01120614593201.13517@sienna.berkeley.edu> <20011207090832.A644@krusty.stanford.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu> Message-ID: <20011207135540.D644@krusty.stanford.edu> On Fri, Dec 07, 2001 at 12:55:14PM -0800, Gavin E. Crooks wrote: > For example, we could have a daemon that runs once a day. It would do a > clean installation, build and test of biopython, and send out warning emails > if anything goes wrong. This would ensure that tests never stay broken for > very long. > > I kind of like this idea, and I may have a go at implementing it sometime. Yeah, that would be cool. All the bio* projects have been needing such a system for build and regression tests. If you're interesting in working on this, or have ideas about how it should be done, please contact Chris Dagdigian (dag@sonsorol.org) and the folks on the website mailing list: http://bioperl.org/mailman/listinfo/webteam Jeff From katel at worldpath.net Fri Dec 7 20:01:25 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Align References: <000701c17ae2$e42349c0$010a0a0a@cadence.com> <20011206080820.C45321@ci350185-a.athen1.ga.home.com> Message-ID: <005b01c17f83$dd897120$010a0a0a@cadence.com> ----- Original Message ----- From: "Brad Chapman" To: Sent: Thursday, December 06, 2001 5:08 AM Subject: Re: [Biopython-dev] Align > Hi Cayte; > > > The sig in Generic.Alignment function, add_sequence, > > def add_sequence(self, descriptor, sequence, start = None, end = None, > > weight = 1.0): > > > > does not allow the caller to pass in the name of the sequence. > > Hmmm... this is what the "descriptor" argument is supposed to be for. Do > you need to pass in more than just this? > Usually I think of name as a tag and descriptor as a line of text that elaborates. Cayte From katel at worldpath.net Fri Dec 7 20:27:35 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Failed Tests References: <01120614593201.13517@sienna.berkeley.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu> <01120713202303.13517@sienna.berkeley.edu> Message-ID: <006301c17f87$87cf3220$010a0a0a@cadence.com> ----- Original Message ----- From: "Gavin E. Crooks" To: Cc: "Cayte" Sent: Friday, December 07, 2001 1:15 PM Subject: Re: [Biopython-dev] Failed Tests > test_metatool is now giving an even more mysterious error message!? > > Gavin > > ====================================================================== > ERROR: test_metatool > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "run_tests.py", line 136, in runTest > __import__(self.test_name) > File "./test_metatool.py", line 32, in ? > print data > File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaToo l/Record.py", line 119, in __ > str__ > File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaToo l/Record.py", line 51, in __str__ > if( self.matrix != None ): > File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Numeric/Use rArray.py", line 163, in __ne__ > def __ne__(self,other): return self._rc(not_equal(self.array,other)) > SystemError: Objects/object.c:727: bad argument to internal function > ---------------------------------------------------------------------- Looks like a possible versioning problem. My version is 20.1 but it looks like 20.2 came down the pike in September. I think you've got a great idea with continuous integration. It would solve versioning too. Let me know if I can help even though my system is Windows for now. Somehow I'm not enthusiastic about XP so I may go for Linux. Cayte From jchang at smi.stanford.edu Tue Dec 11 13:26:51 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? Message-ID: <20011211102650.B412@krusty.stanford.edu> Hello everybody, We've got a lot of new stuff, so I think it's time to roll a new release. This will still be an alpha release, which means that new features are ok, as long as they're relatively bug-free. For core developers, please let me know if this is a good time to do it, when it might be possible (e.g. after this nasty core dump gets fixed today :), or any other issues that might be related to sending this code out into the world... Jeff From chapmanb at arches.uga.edu Tue Dec 11 13:59:05 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <20011211102650.B412@krusty.stanford.edu> References: <20011211102650.B412@krusty.stanford.edu> Message-ID: <20011211135905.A6612@ci350185-a.athen1.ga.home.com> Hey Jeff; Sweet. Glad we're getting this together. I just finished my last paper early Monday morning and am rolling together some lab stuff I've been working on right now, so I should have some time over the next week to work on this. So it's a good time, I think :-) > We've got a lot of new stuff, so I think it's time to roll a new > release. This will still be an alpha release, which means that new > features are ok, as long as they're relatively bug-free. Okay, I've got a bunch of code that could be checked in. Here's the list: => Generic Application Framework (Bio/Application). This is basically what I wrote about previously; a general way to construct commandlines for programs. This includes a commandline for BLAST (Bio/Blast/Program.py) and functionality for running any commandline. This Application stuff also interacts with BioCorba, so it is very cool; I think :-) => Parsers and commandline interfaces for some Emboss primer-related programs (primer3 and primersearch). Bio/Emboss/Primer.py and Program.py plus some martel definitions. => Neural Network code (Bio/NeuralNetwork). Back propagation neural networks, plus code to convert sequences as inputs into Neural networks. => Basic Hidden Markov Models (Bio/HMM). This includes Standard and Baum Welch trainers and Viterbi prediction, all based heavily on the Durbin et al book. => Genetic Algorithm code (Bio/GA). This includes a fairly general Genetic Algorithm framework, so isn't biology specific, but useful. => Drawing code that interacts with the reportlab pdf generation library (Bio/Graphics). This makes it easier to draw pretty pictures of chromosomes, and some other chart and graph stuff. Whew, I think that's it. The code has all been used in real life applications (which is why I wrote it :-), and has fairly good tests written in the standard biopython style. The lacking thing is documentation; I haven't been able to get myself up to writing docs for a while (too many damn papers for classes, I guess :-). What do people think? Do you want any of this? Which modules? Do you want me to make a tarball of the code so you can look at it? If you just want to glance, this is in CVS at: http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/ Let me know what you guys think. I'm very happy to donate this to biopython if you want it, and think I should have time over the next week to check it all in and everything. All-done-blathering-now-ly yr's Brad From katel at worldpath.net Tue Dec 11 21:36:56 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? References: <20011211102650.B412@krusty.stanford.edu> Message-ID: <007601c182b5$dec18340$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: Sent: Tuesday, December 11, 2001 10:26 AM Subject: [Biopython-dev] ready for release? > Hello everybody, > > We've got a lot of new stuff, so I think it's time to roll a new > release. This will still be an alpha release, which means that new > features are ok, as long as they're relatively bug-free. > > For core developers, please let me know if this is a good time to do > it, when it might be possible (e.g. after this nasty core dump gets > fixed today :), or any other issues that might be related to sending > this code out into the world... I downloaded the latest version of NumPy and made a change to my code which fixed a problem Gavin pointed out. However, my tests rely on the repr of Matrix. A change in the latest rev of NumPy causes the printout of the matrices to be a little different. I'll need to submit a new baseline. Also, something on Gavin's system is causing test_nbrf to fail. I downloaded the nbrf files from the CVS browser and ran again and it passed on my Windows system. On Gavin's system it fails near the line feed. Cayte From idoerg at cc.huji.ac.il Tue Dec 11 19:33:13 2001 From: idoerg at cc.huji.ac.il (Iddo Friedberg) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <20011211102650.B412@krusty.stanford.edu> Message-ID: Fine with me. Unless there is a bug in one of "my" modules (Subsmat, FSSP) in which case I cannot do anything about it before the middle of next week. Iddo On Tue, 11 Dec 2001, Jeffrey Chang wrote: : Hello everybody, : : We've got a lot of new stuff, so I think it's time to roll a new : release.This will still be an alpha release, which means that new : features are ok, as long as they're relatively bug-free. : : For core developers, please let me knowif this is a good time to do : it, when it might be possible (e.g. after this nasty core dump gets : fixed today :), or any other issues that might be related to sending : this code out into the world... : : Jeff : _______________________________________________ : Biopython-dev mailing list : Biopython-dev@biopython.org : http://biopython.org/mailman/listinfo/biopython-dev : -- Iddo Friedberg | Tel: +972-2-6757374 Dept. of Molecular Genetics and Biotechnology | Fax: +972-2-6757308 The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il POB 12272, Jerusalem 91120 | Israel | http://bioinfo.md.huji.ac.il/marg/people-home/iddo/ From adalke at mindspring.com Wed Dec 12 05:21:59 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com> I'm doing some work with Martel again. Cayte asked earlier about a way to simplify working with Martel callbacks. I outlined a possible simplification. I've implemented it. It's called 'SimpleFields'. You can take a look at it at http://www.biopython.org/~dalke/SimpleFields.py It supports both callback and iterative styles. (See the module docstring for examples of each.) I'm going to add it to CVS as soon as I think up better names than 'SimpleFields' and 'groups'. (Although I do like 'LAX' :) Is anyone using the iterator facility in Martel? I would like to change the API. Currently you pass it the factory function which produces SAX handlers. I would rather just pass it a SAX handler, and trust the handler to reset itself properly with the startDocument/endDocument methods. (Those which don't can easily be wrapped.) The problem with the current API is when the handler needs parameters then you need to create something which passed those parameters to each instance. It's ugly, and it's common... I think. I also don't like that the object is created for every record instead of reusing the existing one. I don't think anyone uses this feature, so I'll go ahead and change it unless someone gives me a good reason otherwise. Finally, I'm adding some common patterns to the top-level Martel/__init__.py. These are for things like 'Word' which is def Word(name = None, attrs = None): exp = Re(r"\w+") if name is None: if attrs is not None: raise TypeError("....") return exp return Group(name, exp, attrs) The idea is to make it easier to specify, say, a list of words on line format = Word("species") + Whitespace() + \ Word("count") + Whitespace() + \ ToEol("sequence") Has anyone started building up a collection of those common patterns? I've got Integer, SignedInteger, Float, Word, and Whitespace. I'll probably add Spaces (for only " "), NonSpaces (up to a " "). Comments on any of these? Andrew dalke@dalkescientific.com From adalke at mindspring.com Wed Dec 12 05:23:44 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? Message-ID: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> Jeff: >We've got a lot of new stuff, so I think it's time to roll a new >release. This will still be an alpha release, which means that new >features are ok, as long as they're relatively bug-free. > >For core developers, please let me know if this is a good time to do >it, when it might be possible (e.g. after this nasty core dump gets >fixed today :), or any other issues that might be related to sending >this code out into the world... Can you hold up until Friday? I want to get these last bits of Martel changes written, tested, and into CVS. Then I can make Johann happy by having a new Martel release. Andrew From jchang at smi.stanford.edu Wed Dec 12 12:52:12 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011212095212.B304@krusty.stanford.edu> On Wed, Dec 12, 2001 at 03:23:44AM -0700, Andrew Dalke wrote: > Can you hold up until Friday? I want to get these last bits > of Martel changes written, tested, and into CVS. Then I can > make Johann happy by having a new Martel release. Yeah, no problem. Please have a contingency plan so that you can back things out if the changes are taking longer than planned. I'm going on vacation at the end of next week and would like to roll the release before then! :) Thanks, Jeff From jchang at smi.stanford.edu Wed Dec 12 13:07:27 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes In-Reply-To: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com> References: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011212100727.C304@krusty.stanford.edu> On Wed, Dec 12, 2001 at 03:21:59AM -0700, Andrew Dalke wrote: > Is anyone using the iterator facility in Martel? Yes. I'm using it in Bio/Medline/NLMMedlineXML to parse the XML-formatted PubMed records. Each XML file contains about ~30000 records and is too big to keep in memory at once. > I would like to change the API. Currently you pass > it the factory function which produces SAX handlers. > I would rather just pass it a SAX handler, and > trust the handler to reset itself properly with the > startDocument/endDocument methods. (Those which > don't can easily be wrapped.) > > The problem with the current API is when the handler > needs parameters then you need to create something > which passed those parameters to each instance. It's > ugly, and it's common... I think. I also don't like > that the object is created for every record instead > of reusing the existing one. Sure. Let me know if you do it, so that I can update my files accordingly. I don't think it'll be hard to handle what you describe. > Has anyone started building up a collection of those > common patterns? I've got Integer, SignedInteger, Float, > Word, and Whitespace. I'll probably add Spaces (for > only " "), NonSpaces (up to a " "). Sounds good. Looking through my code, other ones I use are Digits (more general name for Integer), Punctuation, and Unprintable(AnyBut(string.printable)). Actually, could you make more general equivalents of some of the names? For example, presumably Digits and Integer would match the same things, but a lot of times you want to match some numerical characters and calling it an integer might be a tad confusing... Jeff From adalke at mindspring.com Wed Dec 12 15:05:55 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com> Me: >> Is anyone using the iterator facility in Martel? Jeff: >Yes. I'm using it in Bio/Medline/NLMMedlineXML to parse the >XML-formatted PubMed records. Each XML file contains about ~30000 >records and is too big to keep in memory at once. Do you pass in just the constructor (no args) or do you need to create a factory function instance which knows how to pass in the args? Can the handler object you use be reinitialized via calling 'startDocument'? >Sure. Let me know if you do it, so that I can update my files >accordingly. I don't think it'll be hard to handle what you describe. It shouldn't be. I'm remember the reasons I didn't do it that way the first time, and I want to see if my concerns (mentioned above) are true or not. >Looking through my code, other ones I use are Digits >(more general name for Integer), Punctuation, >and Unprintable(AnyBut(string.printable)). > >Actually, could you make more general equivalents of some of the >names? For example, presumably Digits and Integer would match the >same things, but a lot of times you want to match some numerical >characters and calling it an integer might be a tad confusing... Ah! Yes, 'Digits' is better than 'Integer'. It also lets me replace 'SignedInteger' with 'Integer'. When do you use Unprintable? When do you use Punctuation? My 'Float' isn't very powerful, as it only understands numbers of the form (with optional +/-) 1 1. 1.2 .2 It doesn't handle things like 1E-3, or IEEE values like NaN or +Inf. I could (and probably should) support the first of these. I'm not sure if I should the second. Andrew dalke@dalkescientific.com From katel at worldpath.net Thu Dec 13 20:26:46 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> Message-ID: <003301c1843e$68ca1f00$010a0a0a@cadence.com> I just updated the MetaTool stuff to handle empty matrices with the latest rev of NumPy. Cayte From gec at compbio.berkeley.edu Thu Dec 13 18:41:15 2001 From: gec at compbio.berkeley.edu (Gavin E. Crooks) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <003301c1843e$68ca1f00$010a0a0a@cadence.com> References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> <003301c1843e$68ca1f00$010a0a0a@cadence.com> Message-ID: <0112131545390R.13517@sienna.berkeley.edu> All regression tests pass! Well, on machine at any rate. Hopefully nothing will break before Jeff gets the release out! Gavin From jchang at smi.stanford.edu Fri Dec 14 01:47:45 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <20011211135905.A6612@ci350185-a.athen1.ga.home.com> References: <20011211102650.B412@krusty.stanford.edu> <20011211135905.A6612@ci350185-a.athen1.ga.home.com> Message-ID: <20011213224745.B627@krusty.stanford.edu> On Tue, Dec 11, 2001 at 01:59:05PM -0500, Brad Chapman wrote: > Okay, I've got a bunch of code that could be checked in. Here's the > list: [cut impressive list of new functionality] > What do people think? Do you want any of this? Which modules? Do you > want me to make a tarball of the code so you can look at it? If you just > want to glance, this is in CVS at: It all looks like useful functionality. Please check it in, provided it's working! :) Jeff From jchang at smi.stanford.edu Fri Dec 14 02:01:59 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes In-Reply-To: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com> References: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011213230159.C627@krusty.stanford.edu> On Wed, Dec 12, 2001 at 01:05:55PM -0700, Andrew Dalke wrote: > Me: > >> Is anyone using the iterator facility in Martel? > > Jeff: > >Yes. I'm using it in Bio/Medline/NLMMedlineXML to parse the > >XML-formatted PubMed records. Each XML file contains about ~30000 > >records and is too big to keep in memory at once. Oops, I just looked over the code. I'm in fact not using the iterator, but thre RecordReader. Sorry about the confusion! [adding Word, Integer, ... as built-in expressions] > When do you use Unprintable? When do you use Punctuation? I use them both for matching things in english text. Sometimes the text contains unprintable characters from foreign character sets. > My 'Float' isn't very powerful, as it only understands > numbers of the form (with optional +/-) > 1 > 1. > 1.2 > .2 > > It doesn't handle things like 1E-3, or IEEE values > like NaN or +Inf. I could (and probably should) support > the first of these. I'm not sure if I should the second. It gets pretty complicated, e.g. 1.315E2.24 Jeff From adalke at mindspring.com Fri Dec 14 07:22:18 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <002b01c18499$f94b2a00$0301a8c0@josiah.dalkescientific.com> Jeff: >Oops, I just looked over the code. I'm in fact not using the >iterator, but thre RecordReader. Sorry about the confusion! No problem, and fewer changes for you! Me: >> When do you use Unprintable? When do you use Punctuation? >I use them both for matching things in english text. Sometimes the >text contains unprintable characters from foreign character sets. Okay, if you say it's useful, I'll add it. What do you define as punctuation? >> My 'Float' isn't very powerful, as it only understands >> numbers of the form (with optional +/-) >It gets pretty complicated, e.g. >1.315E2.24 That's not a valid floating point number -- the exponent must be an integer. BTW, I'm working on a 'Time' submodule, which should make it easier to parse time and date data structures. The language I used is based on strptime, plus some experimental extensions to make it easier for me to use. The idea is to make it easier to parse something like 1970-08-22 using a pattern like %(4-year)-%m-%d than having to write (?P\d{4})-(?P\d{2})-(?\d{2}) all the time. (Plus, the patterns I use are stricter, in that you can't use a day like "43".) For example, (with judicious newlines for clarity) >>> from Martel import Time >>> print Time.make_pattern("%m/%d/%Y") (?P(0[0-9]|1[012]))/ (?P(0[1-9]|[12][0-9]|3[01]))/ (?P\d{4}) >>> >>> parser = Time.make_expression("%(Jan) %(year)\n").make_parser() >>> from xml.sax import saxutils >>> parser.setContentHandler(saxutils.XMLGenerator()) >>> parser.parseString("Dec 2001\n") Dec 2001 >>> It's nearly done - only about an hour of work left. Then to add the useful patterns, and the SimpleFields (or whatever I decide to call it). I should be able to finish it by Friday .. today. The code is temporarily at http://www.biopython.org/~dalke/Time.py but it uses a new 'NullOp' Expression not yet in CVS for doing the 'make_expression' function. Andrew dalke@dalkescientific.com From chapmanb at arches.uga.edu Fri Dec 14 11:36:50 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] ready for release? In-Reply-To: <20011213224745.B627@krusty.stanford.edu> References: <20011211102650.B412@krusty.stanford.edu> <20011211135905.A6612@ci350185-a.athen1.ga.home.com> <20011213224745.B627@krusty.stanford.edu> Message-ID: <20011214113650.A16036@ci350185-a.athen1.ga.home.com> [I blather on about the modules I wrote that could be checked in] > It all looks like useful functionality. Please check it in, provided > it's working! :) Okee dokee. Checked in. Whew. And I think it all works :-). I've checked this on a couple of computers and on Windows, so I think all the tests are cross-platform-good and all files are checked in. If other people could get the CVS and make sure all the tests pass on their computer, I would be very appreciative! You do need reportlab installed for the graphics tests to pass. (http://www.reportlab.com/download.html). By the way, I've just checked the current CVS on Windows and all tests pass. Yay! Thanks to everyone who worked on the cross-platform tests. So I think we're good to go from my side, as long as I didn't muck anything up with my checkins. Brad From adalke at mindspring.com Sat Dec 15 04:42:36 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com> Me to Jeff: >What do you define as punctuation? Duh! I see there's a "string.punctuation". "Punctuation" added to CVS. Also added: Digits == \d+ Word == \w+ Spaces == same as \s+ except not including newline Unprintable == AnyBut(string.printable) These all take an optional name and attributes for a Group. Changed "Integer" to "[+-]?\d+" (It had been the same as what Digits is now.) Removed SignedInteger. Added a new type of Expression -- NullOp. This simplified the implementation of Time.py New submodule "Time.py" for building patterns and/or expressions for parsing strings. Has a full regression test and docstring. Added "LAX" as a new way to handle "simple" XML records. Docstring may need some updating. (It's too late for me to think clearly enough to tell if the documentation is reasonable.) Also, additional documentation on the topic, which I send earlier today to c.l.py, is attached to this email. Bug fixed! - someone in personal email pointed out the named group backreferences ("(?P=name)" construct) weren't working. Turned out I didn't even have a regression test for that case. Both problems now fixed. Regression tests added for all the new code. All tests pass. Some cleanup here and there. Excepting that it would be nice if others could check that my new code (and changes) really does work, I'm ready for a new release. Even ready for a new Martel release. Andrew dalke@dalkescientific.com -------------- next part -------------- An embedded message was scrubbed... From: "Andrew Dalke" Subject: Re: XML parsing besides SAX and DOM Date: Fri, 14 Dec 2001 14:13:06 -0700 Size: 4013 Url: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20011215/6bfd2f7a/ReXMLparsingbesidesSAXandDOM.nws From chapmanb at arches.uga.edu Sat Dec 15 10:08:02 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes In-Reply-To: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com> References: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011215100802.A334@ci350185-a.athen1.ga.home.com> Andrew; Thanks for all the Martel changes -- the new additions look great. > Changed "Integer" to "[+-]?\d+" (It had been the same > as what Digits is now.) > > Removed SignedInteger. The only problem I noticed was that this broke Cayte's metatool parser since it used SignedInteger. I updated metatool to use Digits in place of Integer and Integer in place of SignedInteger. I think this is right based on reading your e-mails, and all tests pass now. Cayte, can you doublecheck and make sure I did the right thing? I don't want to break any of your code. I attached a diff with the changes I made. Brad -- PGP public key available from http://pgp.mit.edu/ -------------- next part -------------- Index: metatool_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/MetaTool/metatool_format.py,v retrieving revision 1.2 retrieving revision 1.3 diff -c -r1.2 -r1.3 *** metatool_format.py 2001/09/13 00:28:47 1.2 --- metatool_format.py 2001/12/15 14:56:05 1.3 *************** *** 18,24 **** import string # Martel ! from Martel import Opt, Alt, Integer, SignedInteger, Group, Str, MaxRepeat from Martel import Any, AnyBut, RepN, Rep, Rep1, ToEol, AnyEol from Martel import Expression from Martel import RecordReader --- 18,24 ---- import string # Martel ! from Martel import Opt, Alt, Digits, Integer, Group, Str, MaxRepeat from Martel import Any, AnyBut, RepN, Rep, Rep1, ToEol, AnyEol from Martel import Expression from Martel import RecordReader *************** *** 32,40 **** lower_case_letter = Group( "lower_case_letter", Any( "abcdefghijklmnopqrstuvwxyz" ) ) digits = "0123456789" ! enzyme = Group( "enzyme", optional_blank_space + Integer() + optional_blank_space + Str( ':' ) + ToEol() ) ! reaction = Group( "reaction", optional_blank_space + Integer() + optional_blank_space + Str( ":" ) + ToEol() ) not_found_line = Group( "not_found_line", optional_blank_space + Str( "- not found -" ) + ToEol() ) --- 32,40 ---- lower_case_letter = Group( "lower_case_letter", Any( "abcdefghijklmnopqrstuvwxyz" ) ) digits = "0123456789" ! enzyme = Group( "enzyme", optional_blank_space + Digits() + optional_blank_space + Str( ':' ) + ToEol() ) ! reaction = Group( "reaction", optional_blank_space + Digits() + optional_blank_space + Str( ":" ) + ToEol() ) not_found_line = Group( "not_found_line", optional_blank_space + Str( "- not found -" ) + ToEol() ) *************** *** 54,61 **** reactions_list ) rev = Group( "rev", Opt( lower_case_letter ) ) ! version = Group( "version", Integer( "version_major") + Any( "." ) + ! Integer( "version_minor") + rev ) metatool_tag = Str( "METATOOL OUTPUT" ) metatool_line = Group( "metatool_line", metatool_tag + blank_space + Str( "Version" ) + blank_space + version + ToEol() ) --- 54,61 ---- reactions_list ) rev = Group( "rev", Opt( lower_case_letter ) ) ! version = Group( "version", Digits( "version_major") + Any( "." ) + ! Digits( "version_minor") + rev ) metatool_tag = Str( "METATOOL OUTPUT" ) metatool_line = Group( "metatool_line", metatool_tag + blank_space + Str( "Version" ) + blank_space + version + ToEol() ) *************** *** 66,85 **** metabolite_count_tag = Str( "INTERNAL METABOLITES:" ) metabolite_count_line = Group( "metabolite_count_line", metabolite_count_tag + ! blank_space + Integer( "num_int_metabolites" ) + ToEol() ) reaction_count_tag = Str( "REACTIONS:" ) reaction_count_line = Group( "reaction_count_line", reaction_count_tag + blank_space + ! Integer( "num_reactions" ) + ToEol() ) type_metabolite = Group( "type_metabolite", Alt( Str( "int" ), \ Str( "external" ) ) ) metabolite_info = Group( "metabolite_info", optional_blank_space + ! Integer() + blank_space + type_metabolite + blank_space + # Integer() + blank_space + Rep1( lower_case_letter ) + Rep1( AnyBut( white_space ) ) ) metabolite_line = Group( "metabolite_line", metabolite_info + ToEol() ) ! metabolites_summary = Group( "metabolites_summary", optional_blank_space + Integer() + blank_space + Str( "metabolites" ) + ToEol() ) metabolites_block = Group( "metabolites_block", Rep1( metabolite_line ) + metabolites_summary + Rep( blank_line ) ) --- 66,85 ---- metabolite_count_tag = Str( "INTERNAL METABOLITES:" ) metabolite_count_line = Group( "metabolite_count_line", metabolite_count_tag + ! blank_space + Digits( "num_int_metabolites" ) + ToEol() ) reaction_count_tag = Str( "REACTIONS:" ) reaction_count_line = Group( "reaction_count_line", reaction_count_tag + blank_space + ! Digits( "num_reactions" ) + ToEol() ) type_metabolite = Group( "type_metabolite", Alt( Str( "int" ), \ Str( "external" ) ) ) metabolite_info = Group( "metabolite_info", optional_blank_space + ! Digits() + blank_space + type_metabolite + blank_space + # Integer() + blank_space + Rep1( lower_case_letter ) + Rep1( AnyBut( white_space ) ) ) metabolite_line = Group( "metabolite_line", metabolite_info + ToEol() ) ! metabolites_summary = Group( "metabolites_summary", optional_blank_space + Digits() + blank_space + Str( "metabolites" ) + ToEol() ) metabolites_block = Group( "metabolites_block", Rep1( metabolite_line ) + metabolites_summary + Rep( blank_line ) ) *************** *** 87,99 **** graph_structure_heading = Group( "graph_structure_heading", optional_blank_space + Str( "edges" ) + blank_space + Str( "frequency of nodes" ) + ToEol() ) graph_structure_line = Group( "graph_structure_line", optional_blank_space + ! Integer( "edge_count" ) + blank_space + Integer( "num_nodes" ) + ToEol() ) graph_structure_block = Group( "graph_structure_block", \ graph_structure_heading + Rep( blank_line ) + Rep1( graph_structure_line ) + Rep( blank_line ) ) sum_is_constant_line = Group( "sum_is_constant_line", optional_blank_space + ! Integer() + optional_blank_space + Any( ":" ) + optional_blank_space + Rep1( AnyBut( white_space ) ) + Rep( blank_space + Any( "+" ) + blank_space + Rep1( AnyBut( white_space ) ) ) + optional_blank_space + Str( "=" ) + ToEol() ) --- 87,99 ---- graph_structure_heading = Group( "graph_structure_heading", optional_blank_space + Str( "edges" ) + blank_space + Str( "frequency of nodes" ) + ToEol() ) graph_structure_line = Group( "graph_structure_line", optional_blank_space + ! Digits( "edge_count" ) + blank_space + Digits( "num_nodes" ) + ToEol() ) graph_structure_block = Group( "graph_structure_block", \ graph_structure_heading + Rep( blank_line ) + Rep1( graph_structure_line ) + Rep( blank_line ) ) sum_is_constant_line = Group( "sum_is_constant_line", optional_blank_space + ! Digits() + optional_blank_space + Any( ":" ) + optional_blank_space + Rep1( AnyBut( white_space ) ) + Rep( blank_space + Any( "+" ) + blank_space + Rep1( AnyBut( white_space ) ) ) + optional_blank_space + Str( "=" ) + ToEol() ) *************** *** 114,121 **** reduced_system_tag = Group( "reduced_system_tag", Str( "REDUCED SYSTEM" ) ) reduced_system_line = Group( "reduced_system_line", reduced_system_tag + ! Rep1( AnyBut( digits ) ) + Integer( "branch_points" ) + ! Rep1( AnyBut( digits ) ) + Integer() + ToEol() ) kernel_tag = Group( "kernel_tag", Str( "KERNEL" ) ) kernel_line = Group( "kernel_line", kernel_tag + ToEol() ) --- 114,121 ---- reduced_system_tag = Group( "reduced_system_tag", Str( "REDUCED SYSTEM" ) ) reduced_system_line = Group( "reduced_system_line", reduced_system_tag + ! Rep1( AnyBut( digits ) ) + Digits( "branch_points" ) + ! Rep1( AnyBut( digits ) ) + Digits() + ToEol() ) kernel_tag = Group( "kernel_tag", Str( "KERNEL" ) ) kernel_line = Group( "kernel_line", kernel_tag + ToEol() ) *************** *** 134,146 **** elementary_modes_line = Group( "elementary_modes_line", \ elementary_modes_tag + ToEol() ) ! num_rows = Group( "num_rows", Integer() ) ! num_cols = Group( "num_cols", Integer() ) matrix_header = Group( "matrix_header", optional_blank_space + Str( "matrix dimension" ) + blank_space + Any( "r" ) + num_rows + blank_space + Any( "x" ) + blank_space + Any( "c" ) + num_cols + optional_blank_space + AnyEol() ) ! matrix_element = Group( "matrix_element", SignedInteger() ) matrix_row = Group( "matrix_row", MaxRepeat( optional_blank_space + matrix_element, \ "num_cols", "num_cols" ) + ToEol() ) matrix = Group( "matrix", MaxRepeat( matrix_row, "num_rows", "num_rows" ) ) --- 134,146 ---- elementary_modes_line = Group( "elementary_modes_line", \ elementary_modes_tag + ToEol() ) ! num_rows = Group( "num_rows", Digits() ) ! num_cols = Group( "num_cols", Digits() ) matrix_header = Group( "matrix_header", optional_blank_space + Str( "matrix dimension" ) + blank_space + Any( "r" ) + num_rows + blank_space + Any( "x" ) + blank_space + Any( "c" ) + num_cols + optional_blank_space + AnyEol() ) ! matrix_element = Group( "matrix_element", Integer() ) matrix_row = Group( "matrix_row", MaxRepeat( optional_blank_space + matrix_element, \ "num_cols", "num_cols" ) + ToEol() ) matrix = Group( "matrix", MaxRepeat( matrix_row, "num_rows", "num_rows" ) ) *************** *** 166,175 **** blank_space + Str( "reactions" ) + ToEol() ) branch_metabolite = Group( "branch_metabolite", optional_blank_space + Rep1( AnyBut( white_space ) ) + blank_space + ! RepN( Integer() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() ) non_branch_metabolite = Group( "non_branch_metabolite", optional_blank_space + Rep1( AnyBut( white_space ) ) + blank_space + ! RepN( Integer() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() ) branch_metabolite_block = Group( "branch_metabolite_block", \ metabolite_roles_heading + metabolite_role_cols + Rep( branch_metabolite ) ) --- 166,175 ---- blank_space + Str( "reactions" ) + ToEol() ) branch_metabolite = Group( "branch_metabolite", optional_blank_space + Rep1( AnyBut( white_space ) ) + blank_space + ! RepN( Digits() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() ) non_branch_metabolite = Group( "non_branch_metabolite", optional_blank_space + Rep1( AnyBut( white_space ) ) + blank_space + ! RepN( Digits() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() ) branch_metabolite_block = Group( "branch_metabolite_block", \ metabolite_roles_heading + metabolite_role_cols + Rep( branch_metabolite ) ) *************** *** 235,238 **** metabolite_count_block + reaction_count_block + stoichiometric_block + Opt( not_balanced_block ) + kernel_block + subsets_block + reduced_system_block + convex_basis_block + conservation_relations_block + ! elementary_modes_block ) \ No newline at end of file --- 235,238 ---- metabolite_count_block + reaction_count_block + stoichiometric_block + Opt( not_balanced_block ) + kernel_block + subsets_block + reduced_system_block + convex_basis_block + conservation_relations_block + ! elementary_modes_block ) From katel at worldpath.net Mon Dec 17 03:48:47 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes References: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com> <20011215100802.A334@ci350185-a.athen1.ga.home.com> Message-ID: <002301c186d7$a5b6dd40$010a0a0a@cadence.com> > Cayte, can you doublecheck and make sure I did the right thing? I > don't want to break any of your code. I attached a diff with the > changes I made. > Seems OK. Cayte From adalke at mindspring.com Mon Dec 17 01:10:23 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <008a01c186c1$841281a0$0301a8c0@josiah.dalkescientific.com> Brad: >The only problem I noticed was that this broke Cayte's metatool >parser since it used SignedInteger. I updated metatool to use Digits >in place of Integer and Integer in place of SignedInteger. I think >this is right based on reading your e-mails, and all tests pass now. My apologies. I was so concerned with getting Martel out that I forgot to run the regression tests in biopython. I checked and it seems that's the only use of Integer or SignedInteger in the biopython release. >Cayte, can you doublecheck and make sure I did the right thing? I >don't want to break any of your code. I attached a diff with the >changes I made. Cayte answered this. I concur, although I haven't run the tests. Andrew From jchang at smi.stanford.edu Mon Dec 17 15:46:36 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] rolling release tonight! Message-ID: <20011217124636.B330@krusty.stanford.edu> Hey Developers, The regression tests seem to be working, with the exception of the test_GraphicsXXX ones that are failing on my system because I don't have reportlab installed. I think the proper way to fix this is to skip tests on systems that don't have the required components installed. Unless someone wants to implement this today, I'm going to let this slide for this release, and put a note about it in the README file. I'm going to build this release tonight, unless I get a red flag from someone. If everything goes right, we'll have a shiny new biopython in the morning! Jeff From chapmanb at arches.uga.edu Mon Dec 17 16:35:47 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] rolling release tonight! In-Reply-To: <20011217124636.B330@krusty.stanford.edu> References: <20011217124636.B330@krusty.stanford.edu> Message-ID: <20011217163547.A4422@ci350185-a.athen1.ga.home.com> Hey Jeff; > The regression tests seem to be working, with the exception of the > test_GraphicsXXX ones that are failing on my system because I don't > have reportlab installed. I think the proper way to fix this is to > skip tests on systems that don't have the required components > installed. Unless someone wants to implement this today, I'm going to > let this slide for this release, and put a note about it in the README > file. Okee dokee, I coded this in. There really isn't any skip support in pyunit (which appears to be a deliberate design decision), so now things look like this when there is an import problem: [...] test_GenBankFormat ... ok test_GraphicsChromosome ... Skipping test because of import error: No module named reportlab.pdfgen ok test_GraphicsDistribution ... Skipping test because of import error: No module named reportlab.pdfgen ok test_GraphicsGeneral ... Skipping test because of import error: No module named reportlab.pdfgen ok test_HMMCasino ... ok [...] Hopefully this works okay for ya. No major error, but at least some notice of skipping the test. > I'm going to build this release tonight, unless I get a red flag from > someone. If everything goes right, we'll have a shiny new biopython > in the morning! Sweet! Giving-biopython-to-all-my-relatives-for-Christmas-ly yr's, Brad -- PGP public key available from http://pgp.mit.edu/ From adalke at mindspring.com Mon Dec 17 16:52:43 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] rolling release tonight! Message-ID: <00bd01c18745$283a7f20$0301a8c0@josiah.dalkescientific.com> Jeff: >I'm going to build this release tonight, unless I get a red flag from >someone. If everything goes right, we'll have a shiny new biopython >in the morning! There are a couple minor changes to Martel; mostly in the documentation and the setup.py. They do not affect the build but I'll work on finishing them up this afternoon so there can be an independent Martel release in parallel to the spiffy new biopython. There's also a couple really minor code changes to make with the new code, so shouldn't affect anyone. I'll send email when it's ready, and if it's after 6pm my time I'll be surprised. Andrew From adalke at mindspring.com Tue Dec 18 00:20:06 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] rolling release tonight! Message-ID: <000c01c18783$a825ce40$0301a8c0@josiah.dalkescientific.com> Me: >I'll send email when it's ready, and if it's after 6pm my >time I'll be surprised. I'm surprised. Well, I did spend more time working on the documentation (in the README). The biggest problem was that distutils doesn't let me install some of the test data files where I thought they should go. But it's for the best, as now the regression tests aren't installed at all; they're only part of the build. I can also make a 'setup.py sdist' and that works. Oh, I added two more definitions to Martel Martel.ToSep -- parse up to a seperator character (or one of several seperator characters) Martel.DelimitedFields -- parse text seperated by a delimiter character (or characters) and made the default iterator return LAX records. For example, the easiest way to parse /etc/passwd, say, to print out which account uses which shell, is import Martel format = Martel.Rep(Martel.Group("record", Martel.DelimitedFields("field", ":"))) infile = open("/etc/passwd") for record in format.make_iterator("record").iterateFile(infile): print record["field"][0], "uses", record["field"][-1] Andrew From jchang at smi.stanford.edu Tue Dec 18 03:24:12 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] rolling release tonight! In-Reply-To: <20011217163547.A4422@ci350185-a.athen1.ga.home.com> References: <20011217124636.B330@krusty.stanford.edu> <20011217163547.A4422@ci350185-a.athen1.ga.home.com> Message-ID: <20011218002412.C320@krusty.stanford.edu> On Mon, Dec 17, 2001 at 04:35:47PM -0500, Brad Chapman wrote: > > The regression tests seem to be working, with the exception of the > > test_GraphicsXXX ones that are failing on my system because I don't > > have reportlab installed. I think the proper way to fix this is to > > skip tests on systems that don't have the required components > > installed. Unless someone wants to implement this today, I'm going to > > let this slide for this release, and put a note about it in the README > > file. > > Okee dokee, I coded this in. There really isn't any skip support in > pyunit (which appears to be a deliberate design decision), so now > things look like this when there is an import problem: Great! Thanks for getting on this so fast! The release is now out. Thanks, Jeff From adalke at mindspring.com Fri Dec 21 06:02:17 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] format autodection Message-ID: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com> Hey all, I'm getting back to working on Biopython. I want to spend some time on the file parsing code. (Like, duh! :) The topics I want to work on next include: - automatic file identification - iterating through records in a file - support for different record types - converting/writing records to a given format I'll send an email for each point, starting now. I have some ideas on file identification. In theory, Martel could be used by just |'ing the terms, except that: - some files may by parsable by multiple formats - a Martel definition parses the whole file, when file type identification need only parse part of the file - it's a linear search What I'm toying around with is something like this: def _recognizeFile(parser, infile): pos = infile.tell() err_h = ... something which can distinguish between a bad parse, and a successful one where unparsed text remains (I'm changing Martel to distiguish the two.) parser.setErrorHandler(err_h) try: try: parser.parseFile(infile) except Martel.Parser.ParserError: pass finally: infile.seek(pos) return err_h.successful_parse class Format: def __init__(self, format_name, expression, recognize_expression = None, provider_url = None, documentation_url = None, description, short_description, maintainer...): if recognize_expression is None: recognize_expression = expression self.expression = expression ... def recognizeFile(self, infile): if _recognizeFile(self.recognize_expression.make_parser(), infile): return self return None class RecognizeFormats: def __init__(self, recognize_expression, formats = None): ... def recognizeFile(self, infile): if _recognizeFile(self.recognize_expression.make_parser(), infile): for format in self.formats: x = format.recognizeFile(infile) if x is not None: return x return None This makes it possible to say from bioformats import swissprot swissprot38 = Format("swissprot/version=38", expression = swissprot.swissprot38.format, recognize_expression = swissprot.swissprot38.record) swissprot39 = Format("swissprot/version=39", expression = swissprot.swissprot39.format, recognize_expression = swissprot.swissprot38.record) swissprot40 = Format("swissprot/version=40", expression = swissprot.swissprot40.format, recognize_expression = swissprot.swissprot38.record) swissprot = RecognizeFormats( Martel.Str("ID ") + Martel.ToEol() + \ Martel.Str("AC ") + Martel.ToEol(), [swissprot40, swissprot39, swisprot38]) swissprot_like = RecognizeFormats( Martel.Re(r"[^ ][^ ] "), [swissprot, ipi, ...]) # This has GenBank records in a row/ no header genbank_records = Format("genbank", ...) # This has the header for the Genbank release genbank_release = Format("genbank-release", ...) genbank = RecognizeFormats(None, [genbank_records, genbank_release]) # Not saying this is the best prefilter pdb = RecognizeFormats(Martel.Re("ATOM |HETATM|HEADER"), [many variations]) sequence_format = RecognizeFormats(None, [swissprot_like, genbank, pdb, ...]) structure_format = RecognizeFormats(None, [pdb, mdl, ...]) any = RecognizeFormats(None, [sequence, alignment, structure]) The result can be used like this: format = sequence_format.recognizeFile(open("unknown.file")) print "It's a", format.name I've tried this out. It works. Given a file or string, I can get a Format definition which (claims to) parse it. There are several things I haven't figured out: 1) How are the formats named? I made up "swissprot/version=38". Is the version attribute enough? If there are other attributes, is there a canonical ordering of attributes. 2) Does the word "recognize" make sense in this context? I tried "identifier" but that's also a commonly used noun. (I choose "recognize" from a post of Thomas's from the end of summer.)q 3) Is information about the intermediate nodes in the tree useful? 4) How are new formats registered? Manually? Or is there a way to autoadd them by dropping files in appropriately designated directories? 5) The top-level definitions require all the lower-level definitions to be available. If there are 50 formats, that might take a while. There needs to be some way to defer loading modules until the parent RecognizeFormats class is asked to recognize something. 6) Version detection depends on tell/seek working. There needs to be a simple wrapper for inputs (like URLs, and sys.stdin) which don't support that action. Jeff added something like this already. 7) What do I do with the format definition once I have it? 8) Does this idea make sense to others? Andrew dalke@dalkescientific.com From adalke at mindspring.com Fri Dec 21 06:02:26 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] record iteration Message-ID: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com> Most data files are of this form:
...
(optional) ... (one or more)
...
(optional)
Nearly everyone only wants to read the records from this file, using a mechanism like this: for record in file: do_something(record) and don't care about the header and footer information. In Martel this can be done by passing in the tag name of the record boundary to the make_iterator method. iterator = format.make_iterator("record") for record in iterator.parseFile(open(filename), Builder()): do_something(record.document) If we standardize on the tag name of "record" then this will work for everything. The existing formats I wrote do not use this standard because they only allowed a tag name. They had things like "swissprot38_record". With the changes I made this summer, Martel grammers can include attributes for the element, as in: ... So my proposal is to standardize on certain tag names, to be shared across all of the Biopython/Martel grammars. These include: dataset record header footer and allow for a standard scaffold for parsing sequence records. BTW, those standard tag names should also include primary_id description (free-form text) sequence (single letter codes) sequence3 (three letter code) xref (cross reference to another database) ... others? As we rework the format definitions, some of these will become apparant. This starts getting into BioXML-type work. Andrew dalke@dalkescientific.com From adalke at mindspring.com Fri Dec 21 06:03:03 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] building a data object Message-ID: <003c01c18a0f$120a6c20$0301a8c0@josiah.dalkescientific.com> Bioperl has only one sequence record data object. One of the points behind Biopython's two parsing systems was to allow the building of different objects without having to rewrite the parser as well. (BioJava has a similar goal, but is more akin to the first Biopython parser and not the Martel one.) Take the example I gave in my previous post: iterator = format.make_iterator("record") for record in iterator.parseFile(open(filename), Builder()): do_something(record.document) In this case, the 'Builder()' is an object which translates SAX events from whichever format is given into a 'document' of whatever is desired. For example, it could be a Swissprot2SeqRecordBuilder GenBank2LightweightSeqBuilder ... Basically, there are two free variables -- input file type and object to make. So this needs some sort of double dispatch mechanism. (That's not strictly true. A GenBank specific data type may only support being built from a GenBank record. For example, a GenBank record to HTML converter need only support GenBank.) Because of the combinitorial explosion, there won't be all that many generalized intermediate formats. I can think of perhaps four: - a "standard" sequence record - a "lightweight" sequence record, when FASTA-style data is enough (If the tag names and semantics are consistent across the different formats, this can be nearly trivial.) - an alignment record - some sort of structure data type. Since there is (will be) format detection, there needs to be some way to determine the right builder given only the requested output type. The implementation is something like this: def readFile(class_to_build, infile): format = set_of_allowed_possibilities.recognizeFile(infile) iterator = format.make_iterator("record") Builder = figure_out_builder(format, class_to_build) for record in iterator.parseFile(open(filename), Builder()): yield record.document so someone can say from Bio import SeqRecord, IO for record in IO.readFile(SeqRecord.SeqRecord, open("unknown.dat")): do_something(record) (there should also be a readString, for symmetry with the XML code in Martel.) I think the best way to implement 'figure_out_builder' is to ask the class for it, perhaps via a static class method. class_to_build.get_builder(format) then this requires either a registration system or some way to determine the builder's location as a module. (eg, the Builder to convert a "swissprot/version=38" format into a SeqRecord could be returned by calling Bio.bioformats.swissprot.SeqRecord.get_builder({"version": "38"}) ) Another way to do the API is to make 'readFile' a static method of the SeqRecord object. This gets rid of the 'IO' module. from Bio import SeqRecord for record in SeqRecord.SeqRecord.readFile(open("unknown.dat")): do_something(record) This looks funny to me, especially since Python doesn't really have static methods. Python 2.2 makes them easier to write. A third option is to use a function in the module namespace, as in from Bio import SeqRecord for record in SeqRecord.readFile(open("unknown.dat")): do_something(record) This is probably the most traditional and appropriate solution. On the other hand, the functionality can't be added automatically through inheritance, which makes it harder to remember what to do. There will need to be an explicit creation of the function, as in from Bio import IO readFile = IO.ReadFile(SeqRecord) Expanding even further, perhaps there should be an "io" object, with this and the write methods (next email): from Bio import SeqRecord for record in SeqRecord.io.readFile(open("unknown.dat")): do_something(record) My problem is that I know this is a double dispatch problem, but I don't know the right way to solve it. I can think of many - perhaps too many. :( Andrew dalke@dalkescientific.com From adalke at mindspring.com Fri Dec 21 06:03:09 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] writing Message-ID: <003d01c18a0f$143bf220$0301a8c0@josiah.dalkescientific.com> Only one more email after this! (And it's a summary.) The opposite to reading is writing. I want to make file conversion easy. Here's the example in Bioperl's SeqIO perldoc: $format1 = shift; $format2 = shift || die "Usage: reformat format1 format2 < input > output"; use Bio::SeqIO; $in = Bio::SeqIO->newFh(-format => $format1 ); $out = Bio::SeqIO->newFh(-format => $format2 ); print $out $_ while <$in>; It should be just as easy for Biopython -- even easier since we have autodetection. import sys from Bio import SeqRecord if sys.argv != 2: sys.exit("Usage: reformat output_format < input > output") writer = SeqRecord.make_writer(sys.argv[1]) for record in SeqRecord.readFile(): writer.write(record) (Same number of lines, about the same number of characters, and I could have done map(SeqRecord.make_writer(sys.argv[1]).write, SeqRecord.readFile()) instead of the last three lines :) Again, there needs to be some resolution system, to figure out the output converter associated with a given format name. There's a twist here that Bioperl doesn't capture - versions. People are going to want the output in "swissprot" version and there may be support for writing it in "swissprot/version=38" and "swissprot/version=39" versions, so something needs to figure out that 39 is probably better than 38 (or force the user to disambigute). There are a few other things I haven't figured out here. I make the writer with 'make_writer'. This is a function in the SeqRecord module scope. It looks like this: def make_writer(output_format = "fasta", outfile = sys.stdout): ... The 'Writer' object created writes SeqRecord objects in the correct format, on the given file handle. I am somewhat worried that finer control may be needed, eg, for "minimal" vs. "complete" output generation. I decided to defer worrying until there is more than one output generator for a given format. I am not sure that "write" is the appropriate method name. There's something to be said for "append", since that's the opposite of iteration. Ie results = [] for x in data: results.append(x) has exactly the same functional form as writer = make_writers() for x in data: writer.write(x) It's also possible that some writers will return strings, rather than write to a file, as in convert = toString(output_format) for x in data: sys.stdout.write(convert(x)) In this case you can see that 'write' in Python traditionally takes a string, not an object. On the other hand, it isn't obvious that 'append' is how to write a record, and nearly everyone will be writing them. I'm still thinking about that "io" object, used like this writer = SeqRecord.io.make_writer(sys.argv[1]) for record in SeqRecord.io.readFile(): writer.write(record) That makes it easier to standardize the interface, since integration is then a matter of: io = StandardIOFramework(SeqRecord) and 'io' can have io.register_reader(format, builder) io.register_writer(format, writer) builder = io.resolve_reader(format) writer = io.resolve_writer(format) for record in io.readFile(open("something.txt")): ... for record in io.readString("SFSDFSDFSDF"): ... Andrew dalke@dalkescientific.com From adalke at mindspring.com Fri Dec 21 06:04:00 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] parsing summary Message-ID: <003e01c18a0f$31cc4e20$0301a8c0@josiah.dalkescientific.com> To summarize: I'm working on a way to minimize the amount of work needed to handle the standard case of for record in data_file: do_something(record) write record to output_file I think I have an API, which is easy to use from Bio import SeqRecord writer = SeqRecord.io.make_writer("genbank") for record in SeqRecord.io.readFile(open("unknown.dat")): do_something(record) writer.write(record) and can handle different intermediate data types from Bio import SimpleSeq writer = SimpleSeq.io.make_writer("fasta") for record in SimpleSeq.io.readFile(open("unknown.dat")): do_something(record) writer.write(record) And it's all built on powerful lower-level forms which are still relatively easy to use. The biggest problem I have is in registeration of all the different format and conversion types. Ideally, added a new format shouldn't affect performance until its presence is needed. That speaks for some sort of file-based discovery mechanism. The simplest solution is to load all files at once, but I expect that to yield poor performance. So there needs to be some sort of defered loading mechanism. Or at least such a mechanism should not be precluded. What I want to do requires coming up with standardized names and data types. These include file formats, field types, and data structures. Thank you for letting me write all this. It's helped clear up what my bottlenecks are in this work. Hopefully you all have some ideas - or you can way I'm trying to be too clever for my own good ! Andrew dalke@dalkescientific.com From jchang at smi.stanford.edu Thu Dec 20 22:40:35 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Re: [BioPython] parse IPI data with biopythons SwissProt parser In-Reply-To: <3C1F0A9B.9080803@proceryon.at> References: <3C1F0A9B.9080803@proceryon.at> Message-ID: <20011220214035.C1338@krusty.dsl.hstntx.swbell.net> (moved from biopython) Hi Wolfgang, These look like relatively minor changes. I'd like to incorporate them into the SProt.py file in the standard distribution, if you don't mind. However, I'm having a little bit of trouble reconstructing the patch from the description given. Do you mind sending me your SProt.py file with all the changes necessary? Thanks, Jeff On Tue, Dec 18, 2001 at 10:21:31AM +0100, Wolfgang Schueler wrote: > Hi all, > > the IPI database at EBI contains proteins from the human genome > from SWISS-PROT, TrEMBL, RefSeq and Ensembl and is available in a > SWISS-PROT format. > Nevertheless there are minor differences to real SWISS-PROT data which > prevent the use of the SWISS-PROT parser of Biopython1.00.a3 > > The following modifications of Sprot.py allowed the parsing of the > IPI-data (find IPI in http://www.ebi.ac.uk/IPI/IPIhelp.html). > > Maybe it is helpful for someone. > Wolfgang > > > > > # ws: changes in _RecordConsumer.date() for IPI > # _RecordConsumer.identification() for IPI > # _Scanner.scanReference() crashing SwissProt entry > # _Scanner.scanDT() for IPI > # _Scanner.scanDE() for IPI > > def _scan_dt(self, uhandle, consumer): > self._scan_line('DT', uhandle, consumer.date, exactly_one=1) > # self._scan_line('DT', uhandle, consumer.date, exactly_one=1) > #ws:2001-12-05----------------------------------------v========v---- # > IPI does not use 'last annotation update' > self._scan_line('DT', uhandle, consumer.date, one_or_more=1) # > # > ^========^------# > # self._scan_line('DT', uhandle, consumer.date, exactly_one=1) # > #^--------------------------------------------------------------------# > > > def _scan_de(self, uhandle, consumer): > #ws:2001-12-05-----------------------------------------------v========v---- > # IPI IPI00029727.2: no DE entry > self._scan_line('DE', uhandle, consumer.description, > any_number=1) # was one_or_more > #------------------------------------------------------------^========^ > def _scan_reference(self, uhandle, consumer): > while 1: > if safe_peekline(uhandle)[:2] != 'RN': > break > self._scan_rn(uhandle, consumer) > self._scan_rp(uhandle, consumer) > self._scan_rc(uhandle, consumer) > self._scan_rx(uhandle, consumer) > # ws:2001-12-05 added, entry exists with RL before RA > # ----------v==============================v > self._scan_rl(uhandle, consumer) > #-----------^==============================^ > > self._scan_ra(uhandle, consumer) > self._scan_rt(uhandle, consumer) > self._scan_rl(uhandle, consumer) > > > def identification(self, line): > cols = string.split(line) > self.data.entry_name = cols[1] > self.data.data_class = self._chomp(cols[2]) # don't want ';' > self.data.molecule_type = self._chomp(cols[3]) # don't want ';' > self.data.sequence_length = int(cols[4]) > > # data class can be 'STANDARD' or 'PRELIMINARY' > # ws:2001-12-05 added to be IPI conform -------------------------v=====v > if self.data.data_class not in ['STANDARD','PRELIMINARY','IPI']: > # ---------------------------------------------------------------^=====^ > raise SyntaxError, "Unrecognized data class %s is in > line\n%s" % \ > (self.data.data_class, line) > # molecule_type should be 'PRT' for PRoTein > if self.data.molecule_type != 'PRT': > raise SyntaxError, "Unrecognized molecule type %s in > line\n%s" % \ > (self.data.molecule_type, line) > > def date(self, line): > uprline = string.upper(line) > if string.find(uprline, 'CREATED') >= 0: > cols = string.split(line) > # ws:2001-12-05 added lines to prevent crash at (IPIrel. , created) !no > number given! > if self._chomp(cols[3]) == '': #<= > self.data.created = cols[1], 0 #<= > else: #<= > self.data.created = cols[1], int(self._chomp(cols[3])) > #-----------^=^-------------------------------------------------------- > elif string.find(uprline, 'LAST SEQUENCE UPDATE') >= 0: > cols = string.split(line) > # ws:2001-12-05 added lines to prevent crash at '(IPIrel. , created)' > !no number given! > if self._chomp(cols[3]) == '': > #<= > self.data.sequence_update = cols[1], 0 #<= > else: #<= > self.data.sequence_update = cols[1], > int(self._chomp(cols[3])) > #-----------^=^---------------------------------------------------------------- > elif string.find(uprline, 'LAST ANNOTATION UPDATE') >= 0: > cols = string.split(line) > # ws:2001-12-05 added lines to prevent crash at '(IPIrel. , created)' > !no number given! > if self._chomp(cols[3]) == '': > #<= > self.data.annotation_update = cols[1], 0 #<= > else: #<= > self.data.annotation_update = cols[1], > int(self._chomp(cols[3])) #<= > #-----------^=^---------------------------------------------------------------- > else: > raise SyntaxError, "I don't understand the date line %s" % line > > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython From katel at worldpath.net Thu Dec 27 18:59:57 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel question Message-ID: <004701c18f32$98a3e860$010a0a0a@cadence.com> I'm not sure what is wrong with my saf format. I used my usual approach when I'm stuck, put it aside for a few days and revisit it. But I'm still puzzled. My instrumentation shows its picking up a tag of "#\nBovine when is should pick up a comment line, then "Bovine".. Like a restriction enzyme cutting at the the wrong site. Could be a Dos issue? Andrew, can I send an attachment? Cayte From katel at worldpath.net Thu Dec 27 23:16:28 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] for the nfinite Martel request queue Message-ID: <006301c18f56$6f152260$010a0a0a@cadence.com> In porting the ecell script from perl to python, I'd have to add case restrictions or sprinkle the code with structures like Alt ( Str( "reactor", "Reactor", "REACTOR" ) ) This is because perl has a one letter case insensitive option. Cayte From adalke at mindspring.com Sun Dec 30 08:37:12 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Bioformat module Message-ID: <001201c19137$16a315a0$0201a8c0@josiah.dalkescientific.com> Hey all, Here's the first go at a module based off the set of emails I wrote last week. It's at http://www.biopython.org/~dalke/Bioformat-0.1.tar.gz No setup.py or anything fancy like that. Though you do need the lastest version of CVS Martel. (List of changes in the next email.) In theory this provides a platform for: - automatic format recognition - using the format information to build a data structure - writing that data structure to another format For example, these parts can be put together for simple, generic file conversion, as in: from Bio import SeqRecord writer = SeqRecord.io.make_writer(sys.stdout, "fasta") writer.writeHeader() # needed for some formats for record in SeqRecord.io.read(open("file.unknown")): writer.write(record) writer.writeFooter() # needed for some formats (Actually, with the code as-is, this is done with from Bioformat import IO IO.io.convert(infile = open("file.unknown"), output_format = "fasta") :) The README includes some examples of how to use this module. Please take a look. More after I have a chance to get some sleep. This project was harder than I thought it would be. OTOH, it's something that should be very exciting for the O'Reilly conference. Andrew dalke@dalkescientific.com From adalke at mindspring.com Sun Dec 30 08:37:26 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] Martel changes Message-ID: <001301c19137$1eca7ca0$0201a8c0@josiah.dalkescientific.com> I needed to make a few changes to Martel to support the Bioformat module I was working on. They are: INCOMPATIBLE CHANGE: - the record readers now support attributes as well as a tag name. I forgot to make those changes last summer. This only affects HeaderFooter and ParseRecord formats. I couldn't figure out a nice way to make the API backwards compatible, so used my "it isn't 1.0" perogative. This affected a couple of the existing Biopython modules (needed to add a {}). I fixed them up and all the regressions pass. I was able to make the change in such a way that code using the old API dies immediately, and it includes a hint on what needs to be changed. Other changes: - a few speed tweaks to the iterator code; my test case of reading a subset of sprot38 into a SeqRecord object is now 10% faster. (The 'characters' callback is used a lot, so I shorted it's path.) - the default iterator boundary tag is 'record' - it's possible for an expression to go to completion but allow some text to remain unparsed. This now throws a new exception (a subtype of the old one) to allow the handlers to do something different for that case. This is used for the Bioformat format recognition code. - Martel.SimpleRecordFilter is used by the Bioformat code to write a quick test filter, to determine if more identification work should be done. Andrew dalke@dalkescientific.com From katel at worldpath.net Sun Dec 30 21:39:05 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:08 2005 Subject: [Biopython-dev] FilteredReader Message-ID: <002801c191a4$54bfbb00$010a0a0a@cadence.com> I added FilteredReader to prefilter text before passing to Martel. The ECell input allows blank lines just about anywhere. Rather than putting Alt( blank_line, read_line ) everywhere, I wrote a filter. To make the routine general, it contains a variable called filter_chain. The user can set it to a list of any low level filters that have the same signature as the default filters. Hopefully, I'm not duplicating Andrews' SimpleRecordFilter? Cayte From adalke at mindspring.com Sun Dec 30 18:54:03 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] for the nfinite Martel request queue Message-ID: <001801c1918d$431a5520$0201a8c0@josiah.dalkescientific.com> Cayte: > In porting the ecell script from perl to python, I'd have to add case >restrictions or sprinkle the code with >structures like > Alt ( Str( "reactor", "Reactor", "REACTOR" ) ) Yeah, I've been using Re("[Rr][Ee][Aa][Cc][Tt][Oo][Rr]") for this, but it's cumbersome to do that by hand. There is a stubbed 'Martel.NoCase' which will eventually support this. Looks like it just needs to duplicate the expression then replace Str, Any, and Literal terms. Shouldn't be too hard, and the result will let you do NoCase(Str("reactor")) Andrew dalke@dalkescientific.com From jchang at smi.stanford.edu Mon Dec 31 02:18:18 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] format autodection In-Reply-To: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com> References: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011230231818.F1032@krusty.stanford.edu> On Fri, Dec 21, 2001 at 04:02:17AM -0700, Andrew Dalke wrote: > 2) Does the word "recognize" make sense in this context? I tried > "identifier" but that's also a commonly used noun. (I choose > "recognize" from a post of Thomas's from the end of summer.)q I was a confused with what was going on in the code until I realized that there's actually two slightly different uses of the word "recognize." In the first use, > def _recognizeFile(parser, infile): recognize is used as a predicate for whether the parser can handle the format of the data in infile. In the second, > class RecognizeFormats: > [...] > def recognizeFile(self, infile): recognize selects between multiple formats and returns the appropriate one for the data. It would clear things up if one of them were renamed something else, e.g. the first use is renamed as "handlesFile" or "acceptsFile". > 6) Version detection depends on tell/seek working. There needs to be > a simple wrapper for inputs (like URLs, and sys.stdin) which don't > support that action. Jeff added something like this already. The file-like handle in File.py is incomplete for this purpose. It can only push back stuff as lines, and not as other blocks of data. It should not be hard to add that capability, though. > 8) Does this idea make sense to others? Yes! And it's sorely needed! :) Jeff From jchang at smi.stanford.edu Mon Dec 31 02:32:12 2001 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] record iteration In-Reply-To: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com> References: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com> Message-ID: <20011230233211.G1032@krusty.stanford.edu> On Fri, Dec 21, 2001 at 04:02:26AM -0700, Andrew Dalke wrote: > So my proposal is to standardize on certain tag names, to be shared > across all of the Biopython/Martel grammars. These include: > dataset > record > header > footer [...] > BTW, those standard tag names should also include > primary_id > description (free-form text) > sequence (single letter codes) > sequence3 (three letter code) > xref (cross reference to another database) > ... others? This all looks good. Do you have a sense on whether there should be a unique prefix or suffix to indicate that a standardized name is being used? e.g. _m_dataset or something like that. Since these names are pretty common, especially in this domain, it might be easy to use a standard tag name when none was intended... Jeff From adalke at mindspring.com Mon Dec 31 05:28:17 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] for the nfinite Martel request queue Message-ID: <001201c191e5$dd359860$0201a8c0@josiah.dalkescientific.com> Me: >the result will let you do > > NoCase(Str("reactor")) Implemented. >>> from Martel import * >>> print NoCase(Str("reactor = ") + Re("[A-D]")) [Rr][Ee][Aa][Cc][Tt][Oo][Rr] = [A-Da-d] >>> Andrew From adalke at mindspring.com Mon Dec 31 05:36:24 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module Message-ID: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com> Okay, I cleaned up the code and added support for the embl65 format. After fixing some bugs, I just dropped in the new format definitions and .. poof! I was building SeqRecords and writing FASTA. Code is at http://www.biopython.org/~dalke/Bioformats-0.2.py There's a small bit more cleanup to do. And documentation. I think it's at the stage where the code can be added to Biopython proper. I would like someone else to take a look at it first, if only to try it out. (It wouldn't hurt to also say "Wow! That's cool!" :) Next is to work on writing format definitions with tags that meet some sort of API. It really is cool that I could just drop in the embl format definition, which (with a minor change) met the minimal API needed to build SeqRecords - and have everything just work. Andrew From adalke at mindspring.com Mon Dec 31 05:50:08 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] format autodection Message-ID: <002201c191e8$ea5713e0$0201a8c0@josiah.dalkescientific.com> Jeff: >I was a confused with what was going on in the code until I realized >that there's actually two slightly different uses of the word >"recognize." After some early attempts, I decided that "recognize" just wasn't the right word to use. I've decided to use "identify", and my solution to the confusion in words is that the identify returns a 'Format'. format = Bioformat.identify(open("file.dat")) if format is not None: print format.name > In the first use, >> def _recognizeFile(parser, infile): >recognize is used as a predicate for whether the parser can handle the >format of the data in infile. I've kept that usage internally. >In the second, >> class RecognizeFormats: >> [...] >> def recognizeFile(self, infile): >recognize selects between multiple formats and returns the appropriate >one for the data. This form is now known as 'identify' I wasn't explicitly aware of the distinction, but what happened to me was it didn't scan well in English. I wrote some sample code and tried to make the names fit the way I decribed what was going on. I ended up with: "I want to identify the format used" "First, we see if this recognizes the format" >It would clear things up if one of them were renamed something else, >e.g. the first use is renamed as "handlesFile" or "acceptsFile". Done. >The file-like handle in File.py is incomplete for this purpose. It >can only push back stuff as lines, and not as other blocks of data. >It should not be hard to add that capability, though. Yeah, I saw that. I've included a 'ReseekFile' which buffers everything read, and allows reseeking to the original position (and only the original position). It only supports the 'read' method, since that's all Martel needs. I only allows tells() at the beginning, and only allows seeks to that position. It has new method called 'nobuffer', which clears the buffer after it's all been (re)read. This prevents the ReseekFile from storing everything even after the file has been parsed. >> 8) Does this idea make sense to others? > >Yes! And it's sorely needed! :) Thanks! Now, take a look at the code to see what the result looks like :) Andrew From adalke at mindspring.com Mon Dec 31 06:03:37 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] record iteration Message-ID: <002f01c191ea$ccbb8260$0201a8c0@josiah.dalkescientific.com> Me, on standardized Martel tag names. Jeff: >This all looks good. Do you have a sense on whether there should be a >unique prefix or suffix to indicate that a standardized name is being >used? Well, it's XML so the best solution is probably XML namespaces. They would look like this: biopython:sequence but I don't know enough about them. I've only been doing non-namespace work. To make things more fun, the SAX 2.0 API is slightly different for namespace tags as compared to non-namespace tags, and I don't know how they are supposed to work. The documentation I have is pre-2.0, and my "Python and XML" book hasn't arrived yet. > e.g. _m_dataset or something like that. Since these names are >pretty common, especially in this domain, it might be easy to use a >standard tag name when none was intended... My plan is several-fold: - make up my own tag names, since it's just us for now - document when/how they are used - look for existing tag names (BTW, I don't understand GAME all that well) - convince others to use Martel, or at least the ideas behind Martel (if no one else uses it, there won't be a namespace problem) - keep going blithely until there is a problem and/or until someone tells me how to use SAX 2.0-style namespaces - present the general Martel/parsing idea in my Biopython talk at the O'Reilly conference - bring up the specific problem in a lightning talk - convince others at the hackathon to help me, as I don't have enough breadth to get things right. Going to be a busy January. :) Andrew