From dalke at acm.org Thu Nov 2 23:36:52 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] unicode and sequences Message-ID: <014801c0454f$b1b47320$19ac323f@josiah> I was browsing through how Unicode works in Python 2.0. I found it interesting that it's similar to how the biopython sequence class works, in that unicode strings take a sequence of bytes and an optional encoding, just like the sequence takes a string of bytes and and an alphabet. There is a difference - I think the unicode code converts everything into UTF-8 encoded Unicode. Still, I liked the similarity so wanted to point it out :) Andrew From katel at worldpath.net Sat Nov 4 21:04:13 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] MaxRepeat References: <014801c0454f$b1b47320$19ac323f@josiah> Message-ID: <000b01c046cc$b34218e0$010a0a0a@cadence.com> My units tests for MaxRepeat, with one parameter, failed. I think the problem is that the tech description shows the lower limit with a default of 0. The code has no default for the lower limit. Cayte From katel at worldpath.net Sun Nov 5 18:41:13 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Assert in Martel Message-ID: <000901c04781$e340c540$010a0a0a@cadence.com> The value in the invert parameter is undefined for values other than 0 or 1. 2 acts like 0, 3 acts like 1. Cayte From katel at worldpath.net Sun Nov 5 20:43:04 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel Test Cases Message-ID: <000901c04792$e88065e0$010a0a0a@cadence.com> Today was a cold, rainy Sunday, perfect for coding and testing. :) I committed a bunch of test cases for Martel. I should add more for Group/GroupRef and some for the operator overloads. My experience that the unit test tool works well for this kind of test but not for parsers, where you'd have to drag a lot of context around. Should we be thinking about a Gui for the new parsers? Cayte From dalke at acm.org Sun Nov 5 18:52:39 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Re: MaxRepeat Message-ID: <027801c04783$7c705b80$b9ab323f@josiah> Cayte: >I think the problem is that the tech description shows the lower limit >with a default of 0. The code has no default for the lower limit. You're right. It should have a default min value of 0. > The value in the invert parameter is undefined for values other than > 0 or 1. 2 acts like 0, 3 acts like 1. Can you give me a test case? The "invert" option should be a test for true/false using the Python definition of 0, [], (), {}, all being false (as well as anything where __nonzero__ or __len__ returns 0) and 1, 2, 3, ... should all be considered true. I looked over the code for how invert is used. There are a few problems with it, like using 'exp.invert == 0' instead of 'not exp.invert', so I'll fix those as well, but they won't give the behaviour you're talking about. > Today was a cold, rainy Sunday, perfect for coding and testing. You sure you aren't in Santa Fe? :) We've had some snow in town, although it's about 40 degrees so it isn't sticking. The freeze level seems to be about 500 feet above us, at least, that's where the snow line appears on the montains. Guess winter is settling on everyone - 'cept my family in Florida. > My experience that the unit test tool works well for this kind of test > but not for parsers, where you'd have to drag a lot of context around. I got that feeling as well. That's why there ended up being a lot of test scaffolding for my regression tests, and my tests don't even check to see if the code matched the right things (they just check to see if it matched *something*). I've been tempted to have a set of golden data, with a bunch of data files for each grammer, then converting the parsed result to a canonical XML form and comparing the result to the gold reference. I haven't gone towards it since I get the feeling that that sort of regression code is too fragile for code still under development. > Should we be thinking about a Gui for the new parsers? I'm not sure what that means. Something like the Tools/redemo.py in the Python distribution? That is, a window with two text regions, one to build the regexp and one containing the text to match. The regions that match can be highlighted, perhaps with a mouseover to show the tag name for a given region. Umm, that won't work since there can be many tags describing a region. Andrew From katel at worldpath.net Mon Nov 6 01:20:33 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Re: MaxRepeat References: <027801c04783$7c705b80$b9ab323f@josiah> Message-ID: <002001c047b9$ac38cf60$010a0a0a@cadence.com> I added more test cases. test_n2 fails, probably something weird about backslash. I can't post the test case, because MIME won't take it, but I checked it in. > > > The value in the invert parameter is undefined for values other than > > 0 or 1. 2 acts like 0, 3 acts like 1. > > Can you give me a test case? The "invert" option should be a test for > true/false using the Python definition of 0, [], (), {}, all being false > (as well as anything where __nonzero__ or __len__ returns 0) and 1, 2, 3, > ... > should all be considered true. > I can't reproduce it. Write it off as eyestrain.:) Cayte From chapmanb at arches.uga.edu Tue Nov 7 04:45:07 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: References: <14844.19384.149965.283975@taxus.athen1.ga.home.com> Message-ID: <14855.53027.829310.153321@taxus.athen1.ga.home.com> Jeff: > Sure. Having some code that would help to diagnose errors in BLAST > reports would be a very nice feature. Certainly more user friendly than > having SyntaxError this or SyntaxError that. > > We would have to build this on top of the current exceptions, though. > It's still nice to have the SyntaxErrors under the hood, as an explanation > on why the parser is complaining in the first place. Okay, I went ahead and tried to implement something to do what we are talking about. The code is attached as a diff to the current NCBIStandalone module. Basically, what I did was implement a class BlastErrorParser that uses the regular BlastParser, but catches SyntaxErrors and tries to figure out the problems with them. It will also optionally save any BLAST reports that cause syntax errors to a file (which I think is a useful feature if you want to look at the records that are causing the errors in a big ol' file of BLAST results). I use copy.deepcopy() to copy the handle, and since I was curious about how this would affect the parsing time, I did a little timing test. This wasn't anything scientific or anything, just a big BLAST report that I had to parse which had errors in it. The results are: using BlastErrorParser -> 1 hour and 31 minutes Starting parsing at: Mon Nov 6 22:38:32 2000 Stopped parsing at: Tue Nov 7 00:09:04 2000 using BlastParser -> 1 hour and 30 minutes Starting parsing at: Tue Nov 7 00:37:56 2000 Stopped parsing at: Tue Nov 7 02:07:57 2000 So I guess the overhead is minimal, and this makes me happy -- if anyone else knows more about timings and wants to do tests, I would be happy to hear about them. Anyways, this does everything I was originally writing about wanting to happen, and I like it, but I'd like to hear people's opinions and comments on it. If people are for including it, then I can check it in and also add a test that uses it to the regression tests. Thanks for all the input on this so far! Brad -------------- next part -------------- *** NCBIStandalone.py.orig Thu Oct 12 13:32:21 2000 --- NCBIStandalone.py Mon Nov 6 22:28:16 2000 *************** *** 36,41 **** --- 36,42 ---- import re import popen2 from types import * + import copy from Bio import File from Bio.ParserSupport import * *************** *** 471,476 **** --- 472,563 ---- consumer.end_parameters() + class LowQualityBlastError(Exception): + """Error caused by running a low quality sequence through BLAST. + + When low quality sequences (like GenBank entries containing only + stretches of a single nucleotide) are BLASTed, they will result in + BLAST generating an error and not being able to perform the BLAST. + search. This error should be raised for the BLAST reports produced + in this case. + """ + pass + + class BlastErrorParser: + """Attempt to catch and diagnose BLAST errors while parsing. + + This utilizes the BlastParser module but adds an additional layer + of complexity on top of it by attempting to diagnose SyntaxError's + that may actually indicate problems during BLAST parsing. + + Current BLAST problems this detects are: + o LowQualityBlastError - When BLASTing really low quality sequences + (ie. some GenBank entries which are just short streches of a single + nucleotide), BLAST will report an error with the sequence and be + unable to search with this. This will lead to a badly formatted + BLAST report that the parsers choke on. The parser will convert the + SyntaxError to a LowQualityBlastError and attempt to provide useful + information. + """ + def __init__(self, bad_report_file = None): + """Initialize a parser that tries to catch BlastErrors. + + Arguments: + o bad_report_file - An optional argument specifying a file to + write any reports that raise errors to. If not specified, these + reports will not be saved. + """ + self._bad_report_file = bad_report_file + # if the report file exists, we want to clear the info in it + if self._bad_report_file and os.path.exists(self._bad_report_file): + tmp = open(self._bad_report_file, 'w') + tmp.close() + + self._b_parser = BlastParser() + + def parse(self, handle): + """Parse a handle, attempting to diagnose errors. + """ + # copy the handle so we have it if we find an error + copy_handle = copy.deepcopy(handle) + + try: + return self._b_parser.parse(handle) + except SyntaxError, msg: + # if we have a bad_report_file, save the info to it first + if self._bad_report_file: + # copy the handle so we can write it + error_handle = copy.deepcopy(copy_handle) + # append the info to the file + error_file = open(self._bad_report_file, 'a') + error_file.write(error_handle.read()) + error_file.close() + + # now we want to try and diagnose the error + self._diagnose_error(copy_handle, self._b_parser._consumer.data) + + # if we got here we can't figure out the problem + # so we should pass along the syntax error we got + raise SyntaxError, msg + + def _diagnose_error(self, handle, data_record): + """Attempt to diagnose an error in the passed handle. + + Arguments: + o handle - The handle potentially containing the error + o data_record - The data record partially created by the consumer. + """ + line = handle.readline() + + while line: + # 'Searchingdone' instead of 'Searching......done' seems + # to indicate a failure to perform the BLAST due to + # low quality sequence + if line[:13] == 'Searchingdone': + raise LowQualityBlastError("Blast failure occured on query: ", + data_record.query) + line = handle.readline() + class BlastParser: """Parses BLAST data into a Record.Blast object. From jchang at SMI.Stanford.EDU Tue Nov 7 17:47:51 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: <14855.53027.829310.153321@taxus.athen1.ga.home.com> Message-ID: > I use copy.deepcopy() to copy the handle Are you sure you can copy file handles in this way? It's not working for me using Python 2.0 on Solaris: Python 2.0 (#1, Oct 17 2000, 12:05:31) [GCC 2.8.1] on sunos5 Type "copyright", "credits" or "license" for more information. >>> from Bio.Blast import NCBIStandalone >>> parser = NCBIStandalone.BlastErrorParser() >>> rec = parser.parse(open('bt001')) Traceback (most recent call last): File "", line 1, in ? File "/home/jchang/lib/jchang/pylib/Bio/Blast/NCBIStandalone.py", line 1578, in parse copy_handle = copy.deepcopy(handle) File "/home/jchang/lib/python2.0/copy.py", line 147, in deepcopy raise error, \ copy.Error: un-deep-copyable object of type >>> I'm trying to parse blast test bt001. + def __init__(self, bad_report_file = None): + """Initialize a parser that tries to catch BlastErrors. + + Arguments: + o bad_report_file - An optional argument specifying a file to + write any reports that raise errors to. If not specified, these + reports will not be saved. Can we make this function take a handle instead of the name of a file? That would allow people to use sys.stderr, if they want the bad files to go to STDERR. The tradeoff is that it would place the burden of creating a handle on the client. Another option is to allow people to pass in either a file name or a handle. While I'm not crazy about this, there is at least one instance of this in Python (see uu.py), and tabnanny.py has a function that takes the name of either a file or directory. Perhaps this is a case of practicality beating purity. Jeff From dalke at acm.org Tue Nov 7 18:55:40 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST Message-ID: <031a01c04916$9ec20f00$89ac323f@josiah> Jeff: >Another option is to allow people to pass in either a file name or a >handle. While I'm not crazy about this, there is at least one instance of >this in Python (see uu.py), and tabnanny.py has a function that takes the >name of either a file or directory. Perhaps this is a case of >practicality beating purity. Guess I'm a purist. (Does that mean I should be using Lisp? :) Passing file handles is The Right Thing. > That would allow people to use sys.stderr Or a StringIO. I have the belief that if there's output it should be useful, and if it's useful, it should be programmatically accessible. Using file names is awkward and cumbersome, since you have to find some writable directory (eg, mktemp and all the problems that entails). I'm currently working with a Python library which uses a lot of file names instead of handle. (It evolved from a set of shell scripts.) It's pretty awkward since I have to wrap everything with functions or objects which hide that it's referencing a file. Otherwise, don't mind me - I haven't been following this thread. Andrew dalke@acm.org From chapmanb at arches.uga.edu Tue Nov 7 19:30:55 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: References: <14855.53027.829310.153321@taxus.athen1.ga.home.com> Message-ID: <14856.40639.744589.612069@taxus.athen1.ga.home.com> Me: > > I use copy.deepcopy() to copy the handle Jeff checks on me: > Are you sure you can copy file handles in this way? It's not working for > me using Python 2.0 on Solaris: [] Ooops -- I should have checked a simplest case. Doh! Thanks for the good catch. Apparently, copy.deepcopy() can copy your magical File.StringHandles but not regular ol' file handles. I was just using the output from the iterator to parse, so I completely missed this. A new version is attached which should work for this case -- it converts things that aren't StringHandles to a StringHandle before proceeding. This way there shouldn't be any extra overhead for using the iterator, but it can handle taking a simple file. [BlastErrorParser taking a file to write bad reports to] Jeff asks: > Can we make this function take a handle instead of the name of a file? > That would allow people to use sys.stderr, if they want the bad files to > go to STDERR. The tradeoff is that it would place the burden of creating > a handle on the client. Andrew agrees: > Guess I'm a purist. (Does that mean I should be using Lisp? :) > Passing file handles is The Right Thing. Agreed on all accounts. Biopython does use file handles for almost everything, so not having a handle here is actually strange and awkward. I've switched this over in the new attached patch. Thanks for the comments! Please let me know of anything else at all. Brad -------------- next part -------------- *** NCBIStandalone.py.orig Thu Oct 12 13:32:21 2000 --- NCBIStandalone.py Tue Nov 7 19:17:35 2000 *************** *** 36,41 **** --- 36,42 ---- import re import popen2 from types import * + import copy from Bio import File from Bio.ParserSupport import * *************** *** 471,476 **** --- 472,563 ---- consumer.end_parameters() + class LowQualityBlastError(Exception): + """Error caused by running a low quality sequence through BLAST. + + When low quality sequences (like GenBank entries containing only + stretches of a single nucleotide) are BLASTed, they will result in + BLAST generating an error and not being able to perform the BLAST. + search. This error should be raised for the BLAST reports produced + in this case. + """ + pass + + class BlastErrorParser: + """Attempt to catch and diagnose BLAST errors while parsing. + + This utilizes the BlastParser module but adds an additional layer + of complexity on top of it by attempting to diagnose SyntaxError's + that may actually indicate problems during BLAST parsing. + + Current BLAST problems this detects are: + o LowQualityBlastError - When BLASTing really low quality sequences + (ie. some GenBank entries which are just short streches of a single + nucleotide), BLAST will report an error with the sequence and be + unable to search with this. This will lead to a badly formatted + BLAST report that the parsers choke on. The parser will convert the + SyntaxError to a LowQualityBlastError and attempt to provide useful + information. + """ + def __init__(self, bad_report_handle = None): + """Initialize a parser that tries to catch BlastErrors. + + Arguments: + o bad_report_handle - An optional argument specifying a handle + where bad reports should be sent. This would allow you to save + all of the bad reports to a file, for instance. If no handle + is specified, the bad reports will not be saved. + """ + self._bad_report_handle = bad_report_handle + + self._b_parser = BlastParser() + + def parse(self, handle): + """Parse a handle, attempting to diagnose errors. + """ + if isinstance(handle, File.StringHandle): + shandle = handle + else: + shandle = File.StringHandle(handle.read()) + + # copy the handle so we have it if we find an error + copy_handle = copy.deepcopy(shandle) + + try: + return self._b_parser.parse(shandle) + except SyntaxError, msg: + # if we have a bad_report_file, save the info to it first + if self._bad_report_handle: + # copy the handle so we can write it + error_handle = copy.deepcopy(copy_handle) + # send the info to the error handle + self._bad_report_handle.write(error_handle.read()) + + # now we want to try and diagnose the error + self._diagnose_error(copy_handle, self._b_parser._consumer.data) + + # if we got here we can't figure out the problem + # so we should pass along the syntax error we got + raise SyntaxError, msg + + def _diagnose_error(self, handle, data_record): + """Attempt to diagnose an error in the passed handle. + + Arguments: + o handle - The handle potentially containing the error + o data_record - The data record partially created by the consumer. + """ + line = handle.readline() + + while line: + # 'Searchingdone' instead of 'Searching......done' seems + # to indicate a failure to perform the BLAST due to + # low quality sequence + if line[:13] == 'Searchingdone': + raise LowQualityBlastError("Blast failure occured on query: ", + data_record.query) + line = handle.readline() + class BlastParser: """Parses BLAST data into a Record.Blast object. From katel at worldpath.net Wed Nov 8 02:36:09 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] platform independence for eol routines in Martel Message-ID: <001d01c04956$910653e0$010a0a0a@cadence.com> os.name gives the python name of the os. We could have a test and different handling for posix and nt. Cayte From jchang at SMI.Stanford.EDU Wed Nov 8 02:09:56 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] platform independence for eol routines in Martel In-Reply-To: <001d01c04956$910653e0$010a0a0a@cadence.com> Message-ID: I've always wanted an os.eol variable that's set to the proper end of line character(s) for your platform. I think it's been brought up before on comp.lang.python. I don't remember why the idea was shot down. Jeff On Tue, 7 Nov 2000, Cayte wrote: > os.name gives the python name of the os. We could have a test and > different handling for posix and nt. > > Cayte > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From dalke at acm.org Wed Nov 8 03:02:40 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] platform independence for eol routines in Martel Message-ID: <034301c0495a$5162fd20$89ac323f@josiah> Jeff: >I've always wanted an os.eol variable that's set to the proper end of line >character(s) for your platform. I think it's been brought up before on >comp.lang.python. I don't remember why the idea was shot down. I think it's because "\n" is always supposed to be newline, and you need to use the right open flags ("t" instead of "b") to get the translation. The "\r", "\r\n", or "\n" is only used in binary mode. Still, I agree. Andrew From jchang at SMI.Stanford.EDU Wed Nov 8 20:40:43 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: <14856.40639.744589.612069@taxus.athen1.ga.home.com> Message-ID: Thanks for the updates. One more thing: Passing the results around as a string would be essentially the same as doing deepcopies of a StringHandle. It would save the overhead of doing a deep copy of an object, and then reading the results. The copy module is nice for arbitrary objects that we don't know about a-priori, but when we only deal with StringHandle's, it's OK to just create one directly when we need it. + def parse(self, handle): + """Parse a handle, attempting to diagnose errors. + """ + if isinstance(handle, File.StringHandle): + shandle = handle + else: + shandle = File.StringHandle(handle.read()) would be: results = handle.read() + try: + return self._b_parser.parse(shandle) return self._b_parser.parse(File.StringHandle(results)) + except SyntaxError, msg: + # if we have a bad_report_file, save the info to it first + if self._bad_report_handle: + # copy the handle so we can write it + error_handle = copy.deepcopy(copy_handle) + # send the info to the error handle + self._bad_report_handle.write(error_handle.read()) if self._bad_report_handle: self._bad_report_handle.write(results) etc Thanks, Jeff From jchang at SMI.Stanford.EDU Fri Nov 10 20:37:05 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... Message-ID: Hello everybody, I'm getting ready to make a 0.90-d04 release of Biopython. A few things need to be done before it: - Brad, I've checked in your BlastErrorParser. I'm saving the results as a string instead of a StringHandle. Could you please look this over and let me know if this is working and acceptable to you? Thanks. - The test_gobase regression tests are failing. The output from the test_gobase.py file doesn't match the golden output. Cayte, could you look into this? - The test_prodoc regression tests are failing. This is mostly my fault, as the previous version of Prodoc didn't allow copyrights at the end of records. However, this has been fixed. Cayte, do you mind having another go at the tests, and checking in the verified output? - The test_align regression tests are failing for me. It complains that saxlib is missing. Do we need to install xmllib for this? I'm using Python 2.0. I thought this came with a SAX api? - Brad, is your new similarity matrix code ready to check in? Thanks, Jeff From katel at worldpath.net Sat Nov 11 05:12:32 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... References: Message-ID: <003301c04bc7$e8c25980$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: Sent: Friday, November 10, 2000 5:37 PM Subject: [Biopython-dev] 0.90-d04 coming soon... > Hello everybody, > > - The test_gobase regression tests are failing. The output from the > test_gobase.py file doesn't match the golden output. Cayte, could you > look into this? > I ran test_gobase.py and then ran a diff between the output and the file, test_gobase in output. The diff didn't show any differences. I checked by eye too. Can you send me the output that is failing? Is it plaform dependent? Or did you retrieve the test htm files from gobase again? Maybe there are changes? I need your output and input. Cayte From chapmanb at arches.uga.edu Sat Nov 11 07:17:17 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: References: Message-ID: <14861.14541.155986.608067@taxus.athen1.ga.home.com> Hi Jeff! > I'm getting ready to make a 0.90-d04 release of Biopython. Great! Do you need any help with anything (besides the points below, of course :-)? Also, what is the deadline for rolling this? I just wanted to write some more docs (PubMed and SwissProt are next on the list, I think). I'm not going to hold you up on this, but if I can get them done before the deadline, I'll try to do it. Also, can we name it '0.90d04' and not '0.90-d04' (ie. no dash in there). When I was playing with making rpms, rpm was complaining about the dash in the name. > - Brad, I've checked in your BlastErrorParser. I'm saving the results as > a string instead of a StringHandle. Could you please look this over and > let me know if this is working and acceptable to you? Thanks. This looks great, Jeff, thanks much for checking it in! I definately agree that using the results directly instead of making them into a StringHandle is much cleaner looking. Thanks again for all of your suggestions on this. Jeff: > - The test_gobase regression tests are failing. The output from the > test_gobase.py file doesn't match the golden output. Cayte, could you > look into this? Cayte: > I ran test_gobase.py and then ran a diff between the output and > the file, test_gobase in output. The diff didn't show any > differences. test_gobase in the regression test also fails for me (although just running test_gobase.py works fine). My output/test_gobase file (which should be exactly what is in CVS) looks like: testing G405967.htm And that's it, which explains why the regressiont test fails for me. Cayte, perhaps you have a more recent copy of output/test_gobase then what is in CVS? > - The test_align regression tests are failing for me. It complains that > saxlib is missing. Do we need to install xmllib for this? I'm using > Python 2.0. I thought this came with a SAX api? I think the saxlib errors are coming from Martel, which is not 2.0 friendly, yet. I attached a patch to Martel-0.3/Martel/Parser.py which should make Martel work with only the 2.0 libraries (ie. no need to install the PyXML package). I believe this should also work with 1.5.2 with PyXML 0.6.1 installed, but I haven't verified this. If this patch doesn't fix anything and you still get errors from test_align, could you send me your trace? I definately want to fix any problems with it! > - Brad, is your new similarity matrix code ready to check in? Well, actually this is all Iddo's code (substitution matrices) -- we've just been talking back and forth about things and working on it together. But, it is ready to go in. It is sitting in my local copy working without a problem -- it also has working tests and documentation (already in Tutorial.tex). Do you want this to go in the next release? I think the code is good to go (it gets the Brad-seal-of-approval :-), but it is up to you. Just give me the word and I can check it in. Thanks again for getting this together! Brad -------------- next part -------------- *** Parser.py.orig Mon Oct 9 06:41:10 2000 --- Parser.py Thu Oct 12 20:16:35 2000 *************** *** 30,36 **** """ import urllib, pprint ! from xml.sax import saxlib import TextTools try: --- 30,38 ---- """ import urllib, pprint ! from xml.sax import xmlreader ! from xml.sax import _exceptions ! from xml.sax import handler import TextTools try: *************** *** 55,61 **** # The SAX startElements take an AttributeList as the second argument. # Martel's attributes are always empty, so make a simple class which # doesn't do anything and which I can guarantee won't be modified. ! class MartelAttributeList(saxlib.AttributeList): def getLength(self): return 0 def getName(self, i): --- 57,63 ---- # The SAX startElements take an AttributeList as the second argument. # Martel's attributes are always empty, so make a simple class which # doesn't do anything and which I can guarantee won't be modified. ! class MartelAttributeList(xmlreader.AttributesImpl): def getLength(self): return 0 def getName(self, i): *************** *** 83,89 **** return alternative # singleton object shared amoung all startElement calls ! _attribute_list = MartelAttributeList() def _do_callback(s, begin, end, taglist, doc_handler): --- 85,91 ---- return alternative # singleton object shared amoung all startElement calls ! _attribute_list = MartelAttributeList([]) def _do_callback(s, begin, end, taglist, doc_handler): *************** *** 128,134 **** doc_handler.characters(s, begin, end-begin) # These exceptions are liable to change in the future ! class StateTableException(saxlib.SAXException): """used when a parse cannot be done""" pass --- 130,136 ---- doc_handler.characters(s, begin, end-begin) # These exceptions are liable to change in the future ! class StateTableException(_exceptions.SAXException): """used when a parse cannot be done""" pass *************** *** 156,162 **** # Special case text for the base DocumentHandler since I know that # object does nothing and I want to test the method call overhead. ! if doc_handler.__class__ != saxlib.DocumentHandler: # Send any tags to the client (there can be some even if there _do_callback(s, 0, pos, taglist, doc_handler) --- 158,164 ---- # Special case text for the base DocumentHandler since I know that # object does nothing and I want to test the method call overhead. ! if doc_handler.__class__ != handler.ContentHandler: # Send any tags to the client (there can be some even if there _do_callback(s, 0, pos, taglist, doc_handler) *************** *** 168,178 **** return None # This needs an interface like the standard XML parser ! class Parser(saxlib.Parser): """Parse the input data all in memory""" def __init__(self, tagtable, want_groupref_names = 0): ! saxlib.Parser.__init__(self) assert type(tagtable) == type( () ), "mxTextTools only allows a tuple tagtable" self.tagtable = tagtable --- 170,180 ---- return None # This needs an interface like the standard XML parser ! class Parser(xmlreader.XMLReader): """Parse the input data all in memory""" def __init__(self, tagtable, want_groupref_names = 0): ! xmlreader.XMLReader.__init__(self) assert type(tagtable) == type( () ), "mxTextTools only allows a tuple tagtable" self.tagtable = tagtable *************** *** 206,239 **** XXX will be removed with the switch to Python 2.0, where parse() takes an 'InputSource' """ ! self.doc_handler.startDocument() if self.want_groupref_names: _match_group.clear() # parse the text and send the SAX events ! result = _parse_elements(s, self.tagtable, self.doc_handler) if result is None: # Successful parse ! self.doc_handler.endDocument() ! return ! elif isinstance(result, saxlib.SAXException): # could not parse record, and wasn't EOF ! self.err_handler.fatalError(result) ! return else: # Reached EOF pos = result ! self.err_handler.fatalError(StateTableEOFException(pos)) ! return def close(self): pass ! class RecordParser(saxlib.Parser): """Parse the input data a record at a time""" def __init__(self, format_name, record_tagtable, want_groupref_names, make_reader, reader_args = ()): --- 208,241 ---- XXX will be removed with the switch to Python 2.0, where parse() takes an 'InputSource' """ ! self._cont_handler.startDocument() if self.want_groupref_names: _match_group.clear() # parse the text and send the SAX events ! result = _parse_elements(s, self.tagtable, self._cont_handler) if result is None: # Successful parse ! pass ! elif isinstance(result, _exceptions.SAXException): # could not parse record, and wasn't EOF ! self._err_handler.fatalError(result) else: # Reached EOF pos = result ! self._err_handler.fatalError(StateTableEOFException(pos)) ! ! # send an endDocument event even after errors ! self._cont_handler.endDocument() def close(self): pass ! class RecordParser(xmlreader.XMLReader): """Parse the input data a record at a time""" def __init__(self, format_name, record_tagtable, want_groupref_names, make_reader, reader_args = ()): *************** *** 249,255 **** reader_args - optional arguments to pass to make_reader after the input file object """ ! saxlib.Parser.__init__(self) self.format_name = format_name assert type(record_tagtable) == type( () ), \ --- 251,257 ---- reader_args - optional arguments to pass to make_reader after the input file object """ ! xmlreader.XMLReader.__init__(self) self.format_name = format_name assert type(record_tagtable) == type( () ), \ *************** *** 272,305 **** """ reader = apply(self.make_reader, (fileobj,) + self.reader_args) ! self.doc_handler.startDocument() if self.want_groupref_names: _match_group.clear() ! self.doc_handler.startElement(self.format_name, _attribute_list) filepos = 0 # XXX can get mixed up with DOS style "\r\n" while 1: record = reader.next() # XXX what if an exception is raised? if record is None: break ! result = _parse_elements(record, self.tagtable, self.doc_handler) if result is None: # Successfully read the record continue ! elif isinstance(result, saxlib.SAXException): # Wrong format ! self.err_handler.fatalError(result) return else: # did not reach end of string pos = filepos + result ! self.err_handler.fatalError(StateTableEOFException(pos)) filepos = filepos + len(record) ! self.doc_handler.endElement(self.format_name) ! self.doc_handler.endDocument() def parse(self, systemId): """parse using the URL""" --- 274,307 ---- """ reader = apply(self.make_reader, (fileobj,) + self.reader_args) ! self._cont_handler.startDocument() if self.want_groupref_names: _match_group.clear() ! self._cont_handler.startElement(self.format_name, _attribute_list) filepos = 0 # XXX can get mixed up with DOS style "\r\n" while 1: record = reader.next() # XXX what if an exception is raised? if record is None: break ! result = _parse_elements(record, self.tagtable, self._cont_handler) if result is None: # Successfully read the record continue ! elif isinstance(result, _exceptions.SAXException): # Wrong format ! self._err_handler.fatalError(result) return else: # did not reach end of string pos = filepos + result ! self._err_handler.fatalError(StateTableEOFException(pos)) filepos = filepos + len(record) ! self._cont_handler.endElement(self.format_name) ! self._cont_handler.endDocument() def parse(self, systemId): """parse using the URL""" From chapmanb at arches.uga.edu Sat Nov 11 12:49:10 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] SwissProt parser Message-ID: <14861.34454.85255.758400@taxus.athen1.ga.home.com> Hello all; I was writing docs for SwissProt, and noticed the parser breaking with some of the sequences I was playing around with. I don't normally use SwissProt, so I have no idea if these entries are representative of the format or anything, but the following entries gave me problems: '023729', '023730', '023731' (some nice Chalcone synthases). The problem was that there is a reference to the NCBI taxonomy id in the entries that the parser wasn't looking for. It occurs right after the organism info and looks like: OX NCBI_TaxID=41205; Anyways, I modified the parser so that it would accept this, and added the possible information to the sequence class. It seems to work okay with the entries I mentioned, and still passes the regression tests. The patch for this is attached. Please let me know if there are any problems with the patch or anything. Thanks! Brad -------------- next part -------------- *** SProt.py.orig Sun Jul 16 19:18:57 2000 --- SProt.py Sat Nov 11 12:30:16 2000 *************** *** 61,66 **** --- 61,67 ---- organelle The origin of the sequence. organism_classification The taxonomy classification. List of strings. (http://www.ncbi.nlm.nih.gov/Taxonomy/) + taxonomy_id NCBI taxonomy id references List of Reference objects. comments List of strings. cross_references List of tuples (db, id1[, id2][, id3]). See the docs. *************** *** 89,94 **** --- 90,96 ---- self.organism = '' self.organelle = '' self.organism_classification = [] + self.taxonomy_id = '' self.references = [] self.comments = [] self.cross_references = [] *************** *** 391,396 **** --- 393,402 ---- self._scan_line('OC', uhandle, consumer.organism_classification, one_or_more=1) + def _scan_ox(self, uhandle, consumer): + self._scan_line('OX', uhandle, consumer.taxonomy_id, + one_or_more=1) + def _scan_reference(self, uhandle, consumer): while 1: if safe_peekline(uhandle)[:2] != 'RN': *************** *** 462,467 **** --- 468,474 ---- _scan_os, _scan_og, _scan_oc, + _scan_ox, _scan_reference, _scan_cc, _scan_dr, *************** *** 540,545 **** --- 547,557 ---- cols = string.split(line, ';') for col in cols: self.data.organism_classification.append(string.lstrip(col)) + + def taxonomy_id(self, line): + line = self._chomp(string.rstrip(line[5:])) + descr, tax_id = string.split(line, '=') + self.data.taxonomy_id = tax_id def reference_number(self, line): rn = string.rstrip(line[5:]) From jchang at SMI.Stanford.EDU Sat Nov 11 15:34:58 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: <14861.14541.155986.608067@taxus.athen1.ga.home.com> Message-ID: > Cayte: > > I ran test_gobase.py and then ran a diff between the output and > > the file, test_gobase in output. The diff didn't show any > > differences. > > test_gobase in the regression test also fails for me (although just > running test_gobase.py works fine). My output/test_gobase file (which > should be exactly what is in CVS) looks like: > > testing G405967.htm > > And that's it, which explains why the regressiont test fails for > me. Cayte, perhaps you have a more recent copy of output/test_gobase > then what is in CVS? Yep, that's what's happening to me as well. The output/test_gobase file contains only that single line, but running test_gobase.py generates 65 lines of output. It looks like output/test_gobase isn't up to date. Jeff From jchang at SMI.Stanford.EDU Sat Nov 11 16:04:36 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] SwissProt parser In-Reply-To: <14861.34454.85255.758400@taxus.athen1.ga.home.com> Message-ID: Heh. It looks like ExPASy snuck in a new tag for us. Thanks for the update and patch! It's now checked in. Jeff On Sat, 11 Nov 2000, Brad Chapman wrote: > Hello all; > I was writing docs for SwissProt, and noticed the parser > breaking with some of the sequences I was playing around with. I don't > normally use SwissProt, so I have no idea if these entries are > representative of the format or anything, but the following entries > gave me problems: '023729', '023730', '023731' (some nice Chalcone > synthases). > > The problem was that there is a reference to the NCBI taxonomy id in > the entries that the parser wasn't looking for. It occurs right after > the organism info and looks like: > > OX NCBI_TaxID=41205; > > Anyways, I modified the parser so that it would accept this, and added > the possible information to the sequence class. It seems to work okay > with the entries I mentioned, and still passes the regression > tests. The patch for this is attached. > > Please let me know if there are any problems with the patch or > anything. Thanks! > > Brad > > From katel at worldpath.net Sun Nov 12 01:16:37 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... References: Message-ID: <001f01c04c70$6cc26060$010a0a0a@cadence.com> > - The test_prodoc regression tests are failing. This is mostly my fault, > as the previous version of Prodoc didn't allow copyrights at the end of > records. However, this has been fixed. Cayte, do you mind having another > go at the tests, and checking in the verified output? > One of the test files has a leading linefeed, 10 decimal, that messes up the start tag. I need to dig up a hex editor to remove it. For the future, maybe Prodoc.py should strip white space before the first tag. Cayte From dalke at acm.org Sun Nov 12 04:56:21 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... Message-ID: <009b01c04c8e$d13bd440$43ac323f@josiah> Brad: >I think the saxlib errors are coming from Martel, which is not 2.0 >friendly, yet. I attached a patch to Martel-0.3/Martel/Parser.py which >should make Martel work with only the 2.0 libraries (ie. no need to >install the PyXML package). I believe this should also work with 1.5.2 >with PyXML 0.6.1 installed, but I haven't verified this. There are saxlib errors with Martel-0.3 when using Python 2.0. Several things changed between the old PyXML package and the new builtin module. They include: o switch from SAX 1.0 to 2.0 support - different methods (eg, 'setContentHandler' instead of 'setDocumentHandler') - different method arguments (eg, 'characters(content)' instead of 'characters(text, start, size)' ) o removal or renaming of several classes - DocumentHandler -> ContentHandler - no XML Canonicalization class - no 'BaseHandler' - no ErrorRaiser class - functionality merged into ErrorHandler and ErrorHandler now needs its __init__ to be called. Brad's patch doesn't catch all of the problems. This evening I finally switch all my code over to use Python 2.0 - at least enough that my regression tests work :) These changes should probably be included in the upcoming version. However, they are *not* backwards compatible either to the Martel 0.3 API or to Python 1.5.2. How does that affect the 0.90-d04 release? How does a dependency on 2.0 affect a 1.0 release? (Actually, I should say it's dependent on the PyXML package and not 1.5.2 per se. It's still tricky because of the API changes between SAX 1.0 and SAX 2.0 and because I've started using Python 2.0 syntax, like "import X as Y".) I've also finished off the iterator support Brad wanted, excepting for some documentation. It works, but it's built on top of the callback method so will always be slower than the SAX-like interface - until someone spends the time needed to rewrite the code to talk to mxTextTools directly. Here's my to-do list for Martel, not all of which will be done for a hypothetical 1.0: o resolve the newline issue o interface for version detection - only need to read part of a file to determine the format/version - support categories? (Eg, "a PDB format" or "a sequence format") o cache tag tables for faster parser creation o attribute lists and XML namespaces - could be useful for version labels (eg, instead of - how to store in a regular expression pattern string - I just don't know enough about namespaces to know if I'm doing this one correctly. Any offers to help? o better debugging support - somehow identify the lastmost character attempted to parse (perhaps with a specialized tag table? Or modify mxTextTools?) - SAX Locator support o more formats, examples, testing, documentation, etc. However, I think the core API is now stable, which means it should be stable enough for people to starting writing parsers based off of it and not have things change from underneath. So Jeff, how would you like things to be scheduled? Andrew From dalke at acm.org Mon Nov 13 08:59:26 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel-0.35 Message-ID: <000a01c04d79$f0b43f60$b2ab323f@josiah> Okay, over the weekend I ended up not doing the work I was paid for and instead worked on Martel. It's available as usual from http://www.biopython.org/~dalke/Martel . This is version 0.35. Here's the change log from the README: Migrated to Python 2.0 and its xml package. No longer runs under older (1.x) Pythons. Added more RecordReaders (Until, CountLines, Nothing, Everything). Changed the RecordReader protocol to seed the line buffer (in the constructor) and to get the final state for the input file and line buffer (using remainder()). Needed to allow chaining of different reader types as with headers and footers. Added a HeaderFooter Parser for formats like Prosite and PIR which have a header and/or a footer with records in between. Renamed the StateTable exception to Parser exceptions and removed the EOF exception. Experimental Iterator support ("make_iterator") as an alternate for the pure SAX callback method. Improved error reporting. make_parser and make_iterator takes an optional "debug_level". Better error location is available with debug_level == 1 and if it == 2, print current match information to stdout. Warning: debug_level == 1 is about 11 times slower than debug_level == 0, which is why it is off by default. Support for both the 1.1 and 1.2 mxTextTools. For people like Brad who are learning how to use Martel, try "expression.make_parser(debug_level = 1)" or debug_level = 2. That really helps pin down where an error is likely located. BTW, I started a Prosite parser. The documentation isn't all that helpful, and I've already found a few errors. For example, in the prosite 39 release, PTS_EIIA_2 has a 5 digit date! (Interestingly, the online version from expasy.ch has only 4 digits but the INFO UPDATE is from 1995.) Andrew dalke@acm.org From johann at egenetics.com Mon Nov 13 09:44:05 2000 From: johann at egenetics.com (Johann Visagie) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel-0.35 In-Reply-To: <000a01c04d79$f0b43f60$b2ab323f@josiah>; from dalke@acm.org on Mon, Nov 13, 2000 at 06:59:26AM -0700 References: <000a01c04d79$f0b43f60$b2ab323f@josiah> Message-ID: <20001113164405.A41426@fling.sanbi.ac.za> Andrew Dalke on 2000-11-13 (Mon) at 06:59:26 -0700: > > Here's the change log from the README: > > Migrated to Python 2.0 and its xml package. Just to make extra sure I understand: Does that mean Martel now only uses the xml package as installed as part of Python 2.0's standard libraries, and not the "extended" xml package as installed by PyXML 0.6.1 (a.k.a _xmlplus)? -- Johann From dalke at acm.org Mon Nov 13 15:16:29 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel-0.35 Message-ID: <00a501c04daf$f4938ce0$b2ab323f@josiah> Johann Visagie : >Just to make extra sure I understand: Does that mean Martel now only uses >the xml package as installed as part of Python 2.0's standard libraries, and >not the "extended" xml package as installed by PyXML 0.6.1 (a.k.a _xmlplus)? That is correct. There are no dependencies on PyXML. The core Martel code uses all stock Python 2.0. I figured that was a good thing. Andrew dalke@acm.org From chapmanb at arches.uga.edu Tue Nov 14 00:18:58 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel-0.35 In-Reply-To: <000a01c04d79$f0b43f60$b2ab323f@josiah> References: <000a01c04d79$f0b43f60$b2ab323f@josiah> Message-ID: <14864.52034.569146.154418@taxus.athen1.ga.home.com> Andrew: > Okay, over the weekend I ended up not doing the work I was paid for > and instead worked on Martel. Join the club :-). Seriously, thanks for this -- the new version looks great! > Migrated to Python 2.0 and its xml package. No longer runs under > older (1.x) Pythons. Thanks for catching all of the changes I missed for 2.0 support. This new version flushed out some errors I made in my Clustalw parser (changes are committed to CVS). > Experimental Iterator support ("make_iterator") as an alternate for > the pure SAX callback method. I had a chance to play with this a little, and seem to be grokking things a lot better. I modified my Martel based Fasta.py parser to use an iterator, so it now acts a little more like the biopython Fasta parser and only reads one record if a file is passed to it. Looks nice, although I definately need to play with it a lot more. > For people like Brad who are learning how to use Martel, try > "expression.make_parser(debug_level = 1)" or debug_level = 2. That > really helps pin down where an error is likely located. This is a really nice feature. Thanks, this'll be a big help. BTW, I took a minute to distutilize Martel (takes about as long as copying everything to site-packages :-), which I guess we'll need to do anyways to include it in the next release. I put everything into a Martel top level package, and install it like that. Anyways, do you want this? Thanks again for the new release. Brad From katel at worldpath.net Tue Nov 14 04:47:54 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... References: Message-ID: <001d01c04e20$05ae4180$010a0a0a@cadence.com> Further investigation of prodoc showed that it choked on TRAILING whitespace. The parser read the first record ok. pdoc00472.txt had some white space that caused the parser to look for another record. IMHO, white space between records should be ignored. I have some cut and paste errors to fix in gobase.py. Since they are in the comments they don't cause a failure but I don't want it to be too obvious that its a part-time effort.:) Cayte From jchang at SMI.Stanford.EDU Tue Nov 14 20:15:40 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: <001d01c04e20$05ae4180$010a0a0a@cadence.com> Message-ID: > Further investigation of prodoc showed that it choked on TRAILING > whitespace. The parser read the first record ok. pdoc00472.txt had some > white space that caused the parser to look for another record. IMHO, white > space between records should be ignored. Agreed. I'll take a look at this soon. Thanks, Jeff From jchang at SMI.Stanford.EDU Tue Nov 14 20:36:29 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: <14861.14541.155986.608067@taxus.athen1.ga.home.com> Message-ID: > Great! Do you need any help with anything (besides the points below, > of course :-)? Also, what is the deadline for rolling this? No real deadline, except for *very soon now*. I think I've got things handled for now, but I remember that you promised me to look into rpm's and windows binaries when the source release is made! :) [alignment/substitution code] > Do you want this to go in the next release? I think the code is > good to go (it gets the Brad-seal-of-approval :-), but it is up to > you. Just give me the word and I can check it in. Yes, if Iddo agrees as well. Please let me know if it's going in! Thanks, Jeff From chapmanb at arches.uga.edu Wed Nov 15 14:27:19 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: References: <14861.14541.155986.608067@taxus.athen1.ga.home.com> Message-ID: <14866.58263.897534.73422@taxus.athen1.ga.home.com> Jeff: > I remember that you promised me to look into rpm's > and windows binaries when the source release is made! :) Darn, I thought you forgot :-). Seriously, I looked at the documentation and tried to learn a little about rpms, and it appears as if you can make rpms using distutils as easily as: python setup.py bdist_rpm As far as I can tell (using rpm -qpl the.rpm), the rpm appears to be complete and in good order. So, I should have no problem making rpms for linuxppc (which is the only linux system I have access to) -- hopefully we can get people to volunteer for other systems as long as we can provide the simple instructions for them. Maybe we can ask about this on the main list once the new distribution is out. Windows will take me a little longer -- there are no docs in distutils, and I still need to learn myself some python on Windows. I will work on it though :-) [should SubsMat go in?] > Yes, if Iddo agrees as well. Please let me know if it's going in! Okee dokee, I just put it in, along with tests and an update on setup.py. Please let me know if any of the tests fail or if it gives you any problems. I'm ccing this to Iddo (not sure if he listens in on the dev list) but hopefully he can make a post about it on the main list and announce that it is in there for people to play with. Enjoy! Brad From idoerg at cc.huji.ac.il Thu Nov 16 05:29:33 2000 From: idoerg at cc.huji.ac.il (Iddo Friedberg) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] 0.90-d04 coming soon... In-Reply-To: <14866.58263.897534.73422@taxus.athen1.ga.home.com> Message-ID: Hi Brad & Jeff, OK, here's the announcement. I think it should be cut & pasted to the general announcement about 0.90-d04 Feel free to make changes in order to accomodate documentation pointers, or anything else. (The Align module accepted replacement matrix generator?) If I need to make a more elaborate announcement, let me know. ---------------------------- CUT HERE ---------------------------------- SubsMat: a module for generating substitution matrices from user data. Documentation is available on http://biopython.org/wiki/html/BioPython/SubsMat.html Accepted replacement matrices (the initial input for a substitution matrix) may be generated using the Align module. XXX documentation pointer? XXX FreqTable: a module for generating alphabet (amino-acid/nucleotide) frequency tables from user data. Documentation is available on: http://biopython.org/wiki/html/BioPython/FreqTable.html ----------------------------- END -------------------------------------- On Wed, 15 Nov 2000, Brad Chapman wrote: : : [should SubsMat go in?] : > Yes, if Iddo agrees as well.Please let me know if it's going in! : : Okee dokee, I just put it in, along with tests and an update on : setup.py. Please let me know if any of the tests fail or if it gives : you any problems. : : I'm ccing this to Iddo (not sure if he listens in on the dev list) but : hopefully he can make a post about it on the main list and announce : that it is in there for people to play with. : : Enjoy! : : Brad : : -- /* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2; /* | */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2* /* | */x*y+b,x=t)t=x*x-y*y+a;} /* --- ddo Friedberg */ From jchang at SMI.Stanford.EDU Sat Nov 18 02:05:03 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) Message-ID: - with Martel 0.35, the alignment stuff now works - I've fixed the Prosite and Prodoc parsers so that they now ignore whitespace. Cayte, do you mind having another look at the test_prodoc regression test? Please verify the results and check in the output file. There's currently no output/test_prodoc - gobase is still failing the regression test. The output/test_gobase only contains one line, and the regression tests are generating more than that. - I don't remember if I addressed it before, but yes, Brad, we can drop the dash. The release will be called 0.90d04. :) Once these are fixed, we can go ahead with the release. Jeff From katel at worldpath.net Sat Nov 18 18:11:28 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: Message-ID: <000701c051b4$e2e51140$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: Sent: Friday, November 17, 2000 11:05 PM Subject: [Biopython-dev] next release closer (?) > - with Martel 0.35, the alignment stuff now works > > - I've fixed the Prosite and Prodoc parsers so that they now ignore > whitespace. Cayte, do you mind having another look at the > test_prodoc regression test? Please verify the results and check in the > output file. There's currently no output/test_prodoc > > - gobase is still failing the regression test. The output/test_gobase > only contains one line, and the regression tests are generating more than > that. > Should we change the baseline? The extra text contains information that tells whether gobase is providing the information it promised. Cayte From katel at worldpath.net Sat Nov 18 20:10:54 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: Message-ID: <001a01c051c5$91744fe0$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: Sent: Friday, November 17, 2000 11:05 PM Subject: [Biopython-dev] next release closer (?) > - with Martel 0.35, the alignment stuff now works > > - I've fixed the Prosite and Prodoc parsers so that they now ignore > whitespace. Cayte, do you mind having another look at the > test_prodoc regression test? Please verify the results and check in the > output file. There's currently no output/test_prodoc > Prodoc now passes the standalone test and I committed test_prodoc. With my upgrade to Python2, br_regrtest causes this output. Traceback (most recent call last): File "br_regrtest.py", line 36, in ? test_support = __import__("test/test_support") NameError: Case mismatch for module name test/test_support (filename c:\python20\lib\test_support.py) Its puzzling because only lower case is used as far as I can see. My environment is: TMP=c:\windows\TEMP TEMP=C:\windows\TEMP PROMPT=$p$g winbootdir=C:\WINDOWS COMSPEC=C:\WINDOWS\COMMAND.COM PATH=C:\JDK1.2.2\BIN;JUNIT3.2;C:\PROGRA~1\CYGNUS~1\ECOS\TOOLS\BIN;C:\BC5\BIN ;C:\ CYGNUS\CYGWIN~1\H-I586~1\BIN;C:\PROGRA~1\TCL\BIN;C:\PERL\BIN;C:\PYTHON20;C:\ WIND OWS;C:\WINDOWS;C:\WINDOWS\COMMAND;C:\PROGRA~1\NETWOR~1\MCAFEE~1;C:\PKWARE;C: \CVS JAXPHOME=C:\Program Files\JavaSoft\Jaxp1_0-ea1 PYTHONPATH=.;C:\PYTHON20\LIB\;C:\PYTHON20\WXPYTHON\;C:\BIOPYT~1.90-;C:\TEXTT O~1; C:\PYXML-~1.1;C:\BIOPYT~1.90-\TESTS VSL=C:\MODSOFT\VSL CLASSPATH=C:\PROGRAM FILES\JAVASOFT\JAXP1_0-EA1\JAXP.JAR;C:\;C:\BIOJAVA;C:\JUNIT 3.2\JUNIT.JAR;. CVSROOT=cvs@cvs.biopython.org windir=C:\WINDOWS BLASTER=A240 I5 D1 T4 CMDLINE=python br_regrtest.py Cayte From chapmanb at arches.uga.edu Sat Nov 18 17:10:48 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <001a01c051c5$91744fe0$010a0a0a@cadence.com> References: <001a01c051c5$91744fe0$010a0a0a@cadence.com> Message-ID: <14870.65128.625316.93731@taxus.athen1.ga.home.com> Cayte writes: > With my upgrade to Python2, br_regrtest causes this output. > > Traceback (most recent call last): > File "br_regrtest.py", line 36, in ? > test_support = __import__("test/test_support") > NameError: Case mismatch for module name test/test_support > (filename c:\python20\lib\test_support.py) > > Its puzzling because only lower case is used as far as I can see. My > environment is: [windows] I just noticed this problem, since I was messing around trying to learn python on windows just this morning! I checked in a fix earlier today, so if you 'cvs update' you should get it. I just changed the offending line to: from test import test_support I'm not sure if there are reasons not to do it this way, but it seemed to make sense to me. Hopefully Andrew will speak up if there is a good reason not to have it this way. Brad From jchang at SMI.Stanford.EDU Sun Nov 19 03:39:23 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <000701c051b4$e2e51140$010a0a0a@cadence.com> Message-ID: > > - gobase is still failing the regression test. The output/test_gobase > > only contains one line, and the regression tests are generating more than > > that. > > > Should we change the baseline? The extra text contains information that > tells whether gobase is providing the information it promised. The baseline contains only: testing G405967.htm It's pretty uninformative, and it must be incomplete. Please check in the hand verified output from the regression tests. Thanks, Jeff From jchang at SMI.Stanford.EDU Sun Nov 19 03:48:26 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <001a01c051c5$91744fe0$010a0a0a@cadence.com> Message-ID: > Prodoc now passes the standalone test and I committed test_prodoc. I'm having a few problems with this suite of tests: - br_regrtest saves the name of the regression test in the first line of the output file. For example, the first line of output/test_seq is "test_seq". This seems to be missing with test_prodoc. - "python br_regrtest test_prodoc.py" fails because it can't find "Prosite/Doc/pdoc00472.txt". That file isn't in the CVS repository and needs to be added. test test_prodoc crashed -- exceptions.IOError : [Errno 2] No such file or direc tory: 'Prosite/Doc/pdoc00472.txt' 1 test failed: test_prodoc - The test_prodoc.py output contains the addresses of Reference objects. references This won't work, because the object address is going to be different from computer to computer. Instead of the pointer, please print out the reference, or at least enough of the string to know that it's parsed correctly. Thanks, Jeff From dalke at acm.org Sun Nov 19 04:21:02 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available) Message-ID: <001601c0520a$0adbc720$edab323f@josiah> Let me restore context first. The question was how to handle different newline conventions, where native text files on the Mac use '\015', on unix use '\012' and MS use '\012\015'. This convention is hidden somewhat behind the C file I/O layer. In text mode it translates the local newline convention to the single character '\n', which in ASCII is '\012'. In binary mode the input character stream is not modified. Martel uses '\n' as the end of line character and converts it to chr(10). This requires the input be in ASCII, which is a good assumption. (I don't expect to run Martel under an IBM 370 any time soon - that being an EBCDIC machine :) This means Martel should be able to run under an OS so long as the input text data has been converted to use the local machine's line ending convention and the file was opened in text mode, which is the default. For example, ftps must be done in ASCII mode instead of binary. Networks make things more complicated. For example, an http connection only supports the binary mode of ftp meaning there is no way to negotiate local newline conventions and automatically convert as needed. Similarly, files shared over NFS or SMB are not automatically converted. (Samba does have a flag to allow for automatic conversion, but I don't believe it is used very often.) On top of that, people are well known for being human - they are inconsistent. I had considered a wrapper which would read the first few characters of a file to determine the newline convention and convert as needed. Some time ago Brad pointed out: > There are times where people have generated files like this in my lab > (the sequencer is running Windows, but they like to play around on > the files on a Mac -- I still don't know how they got a mix of line > breaks -- I think by cutting and pasting between files with different > line breaks). As another case, Roger Sayle pointed out to me yesterday that some of the data files are made by concatenating other files. For example, by merging the gbpri* from GenBank into one file. Suppose some of those files were downloaded via FTP in ASCII mode and some in binary. Then the newline convention changes throughout the merged file. Since this does happen, it would be nice to handle this case gracefully. Earlier I had outlined a few ways to solve the problem: > 1) require the input to be converted to the local line ending and > provide no support for doing so Not graceful. No one likes this solution. > 2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't > use them; instead leaving the decision up to the client code As proposed, this wouldn't work because the line convention can change. Instead, it would need to be a "FromAny" which would allow any of the three endings. > 3) provide a tool which autodetects endings and uses the right > adapter My original thought was to read up to the first newline and use that convention for the rest of the file. This would not work. Instead, the "FromAny" converter would always have to check for all three endings. > 4) http://members.nbci.com/_XOOM/meowing/python/index.html I mentioned this library for two reasons. First, I had heard it was faster than Python's readline() method. This is true, but it is almost exactly as fast as Python's readlines(), which I had been using so it offers no performance benefit. Second, I thought it allowed having all three of "\n", "\r" and "\r\n" as the newline character. After investigation I found out that it doesn't. You can change the end-of-line marker but it must still be a single string. It turns out that mxTextTools has a linesplit function which takes a string and converts it into newlines - allowing any of the three conventions. As it is written it is not appropriate for Martel because it strips out the newlines. A guarantee in Martel's design is that it must send all the characters to the ContentHandler's "characters" method so they may be counted. This allows indexing by just counting the number of characters which have gone through. If the end of line characters are discarded, it is impossible to know if the tossed text was one character or two. The mxTextTools function is very easy to implement and this deficiency is readily remedied. (An alternate solution is to add a tell method to the parser which gets mapped down to the file handle's tell method. This is a problem because the readers are free to read ahead many characters for faster reading. When tell is called, it would have to figure out where the callback is in the parsing. This is complicated even more by text mode file handles on MS where tell works correctly and increased by two for "\r\n" even though only a single character is returned.) > 5) define an EOL = Re(r"\n|\r\n?") > > I don't like 5 because people will forget to use it. Brad liked it because: > 1. Easy to implement, and isn't very likely to break :-). > > 2. Provided the regexp would recognize Mac line breaks (hmmm, I'm not > positive what those look like) then this could deal with files with > multiple different types of line breaks without whining. I ended up having a more serious problem with this option. Martel allows what I call "RecordReaders" that are really two parsers in one. The first does a simple scan of the input stream to identify records and the second parses the records into SAX events. Together they create the same SAX events as a standard parser but use much less memory. (They only need enough memory to parse the most complex record, while the standard parsers parse the whole file at once to need roughly 10 times as much RAM as the input data.) The input data files are line oriented so my RecordReaders used the file's "readlines" method with a sizehint to read a large but memory-bounded number of lines, then scanned those lines to identify the records. The lines are joined back together into one string and parsed with the second stage parser. This makes file reading about as fast as you can do with native Python. However, readlines uses the local platform's definition of newline and there is no way to support all three conventions. If I had a Mac text file, which uses '\013', and tried to read a line under unix, I would get everything in the file as one line since there is no '\010' in the file. So I'm left with the conclusion that I need to write a specialized reader which understands all three line conventions, rather like the 'FromAny' mentioned above. Unlike mxTextTool's linesplit function it would need to keep the end-of-line identifier. Unlike my RecordReaders, it couldn't use the readline or readlines methods but would have to call read directly. Here's how the data would go through the system. Create a file object (open a file, use urllib to create a socket connection or use a StringIO). Wrap it inside a FromAny object, which uses the file's read() method to implement its own readlines() method, which supports the different newline conventions. The RecordReader uses those lines to find the records then merges them back into one string for the record parser. Very complicated, with lots of pure Python code to make things slow. Hence, I didn't like it. As I was looking through the QIO code I came up with an idea, which I think ultimately arises from the bioperl list. Bioperl's FASTA parser works by defining $/ (the line separator) to "\n>". This pushes the problem of record identification to Perl and quite simplifies the read loop. The QIO interface would allow the same simplification, so searching for a SWISS-PROT record could be turned into looking for the string "\nID ". QIO doesn't support all three endings. I could modify the code, but then that would require (yet) another C extension. We're already including mxTextTools, which does text processing - why not use it? That's when I dug through the module and found the 'linesplit' function, which is written in pure Python using the taglist. I hacked together some test code to try it out. It is attached. It parses SWISS-PROT records by looking for lines matching "//" followed by "\n", "\r\n" or "\r" and using them as end of record indicators. After some tweaking of the tatable to remove a subtable call, I found out it was 15% *faster* than the readlines code. (I haven't yet tested it on MS to ensure it handles both text and binary reads, but it should. :) It works on a large block of text at a time rather than splitting them apart into lines. The record parser uses a single block of text so the current RecordReaders need to string.join the lines back into a block. This new approach only needs to use a single subslice to get that text, so overall it should be a bit faster still. WHAT DOES THIS GET US? This new approach makes record identification much faster and allows the record readers to work on files containing a mix of any of the three standard line encodings. This means my objection to option 5 no longer includes any objections based on parsing performance. There are still some problems with usability. In binary mode, or with foreign text files, the parser can send back "\n", "\r\n" or "\r" characters as newlines. The format definition must support them. The format definition for newline is simply "\n" which is insufficient. For example, suppose you just want to read the text of the DE line in SWISS-PROT. The current format definition might be: DE = Group("DE", Re("DE (?P[^\n]*)\n")) This would have to be replaced with DE = Group("DE", Re("DE (?P[^\n\r]*)(\n|\r\n?)")) There are two changes: one for "description" from [^\n] to [^\n\r] and the other from \n to \n|\r\n? . They are simple mechanical transformation but the need for them may be sufficiently different from common use that it would be nice to automate it or otherwise ignore their need. I mentioned one possibility - define EOL = Re("\n|\r\n?"). Then the DE format definition becomes: DE = Group("DE", Re("DE (?P[^\n\r]*)") + EOL) This is simpler to type and less error prone than using the full, correct definition, but isn't as nice as "\n". It isn't standard so I think people will forget to put in the EOL in place of "\n". Finally, it doesn't fix the need to use [^\n\r]. Here is a solution which appears to make the problem disappear. If "\n" is ever found outside of a [] then replace it with "\n|\r\n?". If it is ever found inside of a [], then also include "\r". The problem is that it violates one of my basic design beliefs. Things which act different should not look the same. Other regular expression parsers do not support this conversion so I do not want to use it. (Martel doesn't support backtracking inside of repeats. You may jusifiable call it a violation of this belief. On the other hand, any solution which works in Martel should work using a normal regular expression engine, so the implementation is really a subset of existing behaviour and not a new behaviour.) Here's another possibility. There are still some letters unused as escape sequences in both Perl and Python. What about defining \R to mean "platform-independent newline character"? When used outside of []s it gets turned into "\n|\r\n?" and when used inside of []s is the same as [\r\n]. I chose \R because \N in perl is used for "named char". Its use would change the DE definition from DE = Group("DE", Re("DE (?P[^\n]*)\n")) to DE = Group("DE", Re("DE (?P[^\R]*)\R")) It is still a non-standard definition, which means it isn't as nice as I would like for it to be. However, I haven't found any other regular expression grammer which supports alternate newline conventions so there isn't really any standard to be standard to. The only time it would be used is in the Martel definition. Converting the Martel expression back to a regular expression pattern would use the "\n|\r\n?" or "[\r\n]" descriptions, so the expression itself is still standard; the \R is simply a shorthand notation, like \n is itself shorthand for \010. The conversion is mechanical and is in most cases a simple text substitution. That makes it easy to use, although it's existance and need would need to be carefully documented and enforced with social pressure. ("You *do* know that \n doesn't work as well as \R, right?") In closing, I've come up with a way to increase parsing performance and in a way which is platform independent and requires few changes in people's understanding of regular expression syntax. The first part (increased performance) does not affect what I consider to be the stable part of the API. The second part does change things from their commonly accepted use so I would like to hear any comments people may have about it. Andrew dalke@acm.org P.S. In retrospect using mxTextTools for the record reading is obvious and solves quite a few problems I was having. I hate it when that happens because it make me feel dim-witted. After all, I've been thinking about this problem for a long time - why was I stuck in the old solution? But that's the way things go :) P.P.S. - and irrelevant FSU (my alma mater) beat Florida and will likely be ranked as the number 2 college team. Miami's complaining that they'll be #3 after FSU even though they beat FSU. Of course, they forget '89 when FSU beat Miami but Miami was ranked #1. -------------- next part -------------- # % python read2.py # 80000 records found with readlines # Time for readlines 118.07604301 # 80000 records found with find_record_ends # Time for tagtables 100.489358068 # % from Martel import Generate from mx import TextTools as TT tagtable = ( # Is the current line the end of record marker? (None, TT.Word, "//", +5, +1), # Make sure it ends the line ("end", TT.Is, '\n', +1, -1), # matches '\n' (None, TT.Is, '\r', +3, +1), ("end", TT.Is, '\n', +1, -3), ("end", TT.Skip, 0, -4, -4), # Not the end of record marker, so read to the end of line (None, TT.AllInSet, TT.invset('\r\n'), +1, +1), # Check if EOF (None, TT.EOF, TT.Here, +1, TT.MatchOk), # Not EOF, so scarf any newlines (None, TT.AllInSet, TT.set('\r\n'), TT.MatchFail, -7), ) def find_record_ends(text): result, taglist, pos = TT.tag(text, tagtable) ends = [] for tag in taglist: ends.append(tag[2]) return ends def test1(): expect = ( '//\n', 'Andrew Dalke\n//\n', 'was //\nhere\n//\n', '//\n' ) text = "//\nAndrew Dalke\n//\nwas //\nhere\n//\n//\n" ends = find_record_ends(text) assert len(expect) == len(ends), (len(expect), len(ends)) prev = 0 for ex, end in map(None, expect, ends): s = text[prev:end] assert ex == s, (ex, s) prev = end print "expected lines found" def test2(): infile = open("/home/dalke/ftps/swissprot/sprot38.dat") s = "" count = 0 while 1: data = infile.read(1000000) #print "Loop", count, len(s) if not data: break ends = find_record_ends(s+data) if not ends: s = data continue s = data[ends[-1]:] count = count + len(ends) assert not s, "still have data: %s" % repr(s[:200]) print count, "records found with find_record_ends" def test3(): infile = open("/home/dalke/ftps/swissprot/sprot38.dat") count = 0 while 1: lines = infile.readlines(1000000) if not lines: break #print "Loop", count for line in lines: if line == "//\n": count = count + 1 print count, "records found with readlines" def do_time(): import time t1 = time.time() test3() t2 = time.time() print "Time for readlines", t2-t1 t1 = time.time() test2() t2 = time.time() print "Time for tagtables", t2-t1 if __name__ == "__main__": #test1() #test2() do_time() From chapmanb at arches.uga.edu Sun Nov 19 12:30:29 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Small change to NCBIWWW Message-ID: <14872.3637.919725.20317@taxus.athen1.ga.home.com> Hello all; I was using NCBIWWW.blast() to access the BLAST cgi script this morning and noticed that the parameters used to restrict the organism type to BLAST against weren't quite working right. The gi_list variable was included as a dictionary parameter, but wasn't actually being passed to the CGI script. Attached is a patch which fixes this (oooh, one line fix! Very impressive :-). and also adds support for the LIST_ORG box in which you can specify an arbitrary organism to blast against (ie. other organisms that aren't in their pull down box). Let me know if anything doesn't seem right with this. Both options now seem to work fine for me. Thanks! Brad -------------- next part -------------- *** NCBIWWW.py.orig Thu Oct 19 21:31:54 2000 --- NCBIWWW.py Sun Nov 19 11:54:59 2000 *************** *** 531,537 **** def blast(program, datalib, sequence, input_type='Sequence in FASTA format', ! double_window=None, gi_list='(None)', expect='10', filter='L', genetic_code='Standard (1)', mat_param='PAM30 9 1', other_advanced=None, ncbi_gi=None, overview=None, --- 531,538 ---- def blast(program, datalib, sequence, input_type='Sequence in FASTA format', ! double_window=None, gi_list='(None)', ! list_org = None, expect='10', filter='L', genetic_code='Standard (1)', mat_param='PAM30 9 1', other_advanced=None, ncbi_gi=None, overview=None, *************** *** 542,548 **** ): """blast(program, datalib, sequence, input_type='Sequence in FASTA format', ! double_window=None, gi_list='(None)', expect='10', filter='L', genetic_code='Standard (1)', mat_param='PAM30 9 1', other_advanced=None, ncbi_gi=None, overview=None, --- 543,550 ---- ): """blast(program, datalib, sequence, input_type='Sequence in FASTA format', ! double_window=None, gi_list='(None)', ! list_org = None, expect='10', filter='L', genetic_code='Standard (1)', mat_param='PAM30 9 1', other_advanced=None, ncbi_gi=None, overview=None, *************** *** 570,575 **** --- 572,579 ---- 'DATALIB' : datalib, 'SEQUENCE' : sequence, 'DOUBLE_WINDOW' : double_window, + 'GI_LIST' : gi_list, + 'LIST_ORG' : list_org, 'INPUT_TYPE' : input_type, 'EXPECT' : expect, 'FILTER' : filter, From jchang at SMI.Stanford.EDU Sun Nov 19 12:43:54 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Small change to NCBIWWW In-Reply-To: <14872.3637.919725.20317@taxus.athen1.ga.home.com> Message-ID: Good catch! I incorporate the fix into the CVS tree. Thanks, Jeff On Sun, 19 Nov 2000, Brad Chapman wrote: > Hello all; > I was using NCBIWWW.blast() to access the BLAST cgi script this > morning and noticed that the parameters used to restrict the organism > type to BLAST against weren't quite working right. The gi_list > variable was included as a dictionary parameter, but wasn't actually > being passed to the CGI script. > > Attached is a patch which fixes this (oooh, one line fix! Very > impressive :-). and also adds support for the LIST_ORG box in which > you can specify an arbitrary organism to blast against (ie. other > organisms that aren't in their pull down box). > > Let me know if anything doesn't seem right with this. Both options now > seem to work fine for me. Thanks! > > Brad > > > From dalke at acm.org Mon Nov 20 01:01:10 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available) Message-ID: <002b01c052b7$48e47b40$fbab323f@josiah> Me: >It works on a large block of text at a time rather than splitting them >apart into lines. The record parser uses a single block of text so >the current RecordReaders need to string.join the lines back into a >block. This new approach only needs to use a single subslice to get >that text, so overall it should be a bit faster still. I've got a first pass at replacing the StartsWith RecordReader. The old reader (readlines and string.join) takes about 160 seconds to read sprot38.dat while the new one takes about 90 seconds. I also checked and they return identical results. >Here's another possibility. There are still some letters unused as escape >sequences in both Perl and Python. What about defining \R to mean >"platform-independent newline character"? When used outside of []s it >gets turned into "\n|\r\n?" and when used inside of []s is the same as >[\r\n]. I chose \R because \N in perl is used for "named char". I've got a first pass at this as well. sre_parse.py is very clean code to modify. The result seems to pass my regression tests. Still need to try it against real data on a non-unix platform. But that's all for the next day or so since I've got to get back to paying work now. Andrew dalke@acm.org From katel at worldpath.net Mon Nov 20 05:00:19 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: Message-ID: <002b01c052d8$b19ddb60$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: "Cayte" Cc: Sent: Sunday, November 19, 2000 12:48 AM Subject: Re: [Biopython-dev] next release closer (?) > > Prodoc now passes the standalone test and I committed test_prodoc. > > I'm having a few problems with this suite of tests: > > - br_regrtest saves the name of the regression test in the first line of > the output file. For example, the first line of output/test_seq is > "test_seq". This seems to be missing with test_prodoc. I'm still investigating this. > > - "python br_regrtest test_prodoc.py" fails because it can't find > "Prosite/Doc/pdoc00472.txt". That file isn't in the CVS repository and > needs to be added. > test test_prodoc crashed -- exceptions.IOError : [Errno 2] No such file or > direc > tory: 'Prosite/Doc/pdoc00472.txt' > 1 test failed: test_prodoc > I checked the file in. > - The test_prodoc.py output contains the addresses of Reference objects. > references > > > > This won't work, because the object address is going to be different from > computer to computer. Instead of the pointer, please print out the > reference, or at least enough of the string to know that it's parsed > correctly. > I fixed this. I still need to add a baseline for rebase. Cayte From katel at worldpath.net Tue Nov 21 02:53:04 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: Message-ID: <002a01c05390$15759f80$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: "Cayte" Cc: Sent: Sunday, November 19, 2000 12:48 AM Subject: Re: [Biopython-dev] next release closer (?) > > Prodoc now passes the standalone test and I committed test_prodoc. > > I'm having a few problems with this suite of tests: > > - br_regrtest saves the name of the regression test in the first line of > the output file. For example, the first line of output/test_seq is > "test_seq". This seems to be missing with test_prodoc. > Its also missing from test_seq and test_Fasta when I run them standalone. Is the test name inserted manually into the baseline files? If so, I'll also have to add it to test_rebase. I get an error from test_prosite. My OS is Win98. test test_prosite crashed -- exceptions.TypeError : an integer is required Cayte Cayte From jchang at SMI.Stanford.EDU Mon Nov 20 23:50:54 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <002a01c05390$15759f80$010a0a0a@cadence.com> Message-ID: [Jeff] > > - br_regrtest saves the name of the regression test in the first line of > > the output file. For example, the first line of output/test_seq is > > "test_seq". This seems to be missing with test_prodoc. [Cayte] > Its also missing from test_seq and test_Fasta when I run them > standalone. Is the test name inserted manually into the baseline files? If > so, I'll also have to add it to test_rebase. br_regrtest should do it automatically. From the biopython/Tests directory, run: python br_regrtest -v test_Fasta and the first line will be 'test_seq'. To generate the file in the output directory, do: python br_regrtest -g test_Fasta This will create a file in output/test_Fasta, whose first line will be 'test_Fasta'. This will need to be verified by hand in order for the regression tests to be accurate. Sorry about the confusion. > I get an error from test_prosite. My OS is Win98. > > test test_prosite crashed -- exceptions.TypeError : an integer is required I don't know. Andrew? One thing you can try, is to run: python test_prosite.py and see the full stack dump that's generated. Jeff From dalke at acm.org Tue Nov 21 00:38:24 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) Message-ID: <001201c0537d$45c30500$9bac323f@josiah> Jeff on the error Cayte's getting: >> I get an error from test_prosite. My OS is Win98. >> >> test test_prosite crashed -- exceptions.TypeError : an integer is required > >I don't know. Andrew? > >One thing you can try, is to run: >python test_prosite.py > >and see the full stack dump that's generated. I would need to see the stack trace. I cannot reproduce the error using the current CVS version. I don't see the string "an integer is required" anywhere in the Prosite code, nor in the rest of the biopython distribution. Looking at the source code for Python, that only arises during a conversion to int. So I would need to find out which call to int is failing and the text that it's trying to convert. Andrew dalke@acm.org From dalke at acm.org Tue Nov 21 00:59:33 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available) Message-ID: <005f01c05380$3a250d80$9bac323f@josiah> [Continuing the thread] mxTextTools is really fast, but it's very hard to write raw tagtables. It's all one state table with no symbolic jump labels. Blech. I finished up the first drafts of the new StartsWith and EndsWith RecordReaders. The new EndsWith parser is about 50% faster than the readlines based one. The source code is temporarily at http://www.biopython.org/~dalke/RecordReader.py for anyone who wants to review it. Not much yet in the way of comments, I'm afraid. Andrew From katel at worldpath.net Sat Nov 25 02:13:44 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] mxTextTools Message-ID: <002701c056af$406b1900$010a0a0a@cadence.com> Andrew, do you have a Windows compiled version of mxTextTools. My VC++ CD disappeared and the old pyd no longer works with Python 2.00. Cayte From katel at worldpath.net Sat Nov 25 22:34:06 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Martel 3.5 recompile Message-ID: <000b01c05759$bc000e40$010a0a0a@cadence.com> My VC++ CD turned up, so I recompiled. The following stack trace appeared, when I ran m tests. C:\biopython-0.90-d03\Martel\UnitTests>python RunMartelTestCase.py Traceback (most recent call last): File "RunMartelTestCase.py", line 12, in ? import MartelTestCase File "MartelTestCase.py", line 23, in ? import Martel File "c:\biopyt~1.90-\Martel\__init__.py", line 3, in ? import Expression File "c:\biopyt~1.90-\Martel\Expression.py", line 25, in ? import Parser File "c:\biopyt~1.90-\Martel\Parser.py", line 34, in ? import TextTools File "c:\textto~1\TextTools.py", line 230, in ? def _replace3(text,what,with, NameError: There is no variable named 'FS' The recompile of mxTextTools.pyd gave 1 warning. LINK : warning LNK4049: locally defined symbol "_mxBMS_Type" imported Cayte From katel at worldpath.net Sat Nov 25 22:51:31 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Martel Message-ID: <001501c0575c$2abe5ba0$010a0a0a@cadence.com> I answered my own question, the FS is in the __init file, which was in a different path. Cayte From katel at worldpath.net Sat Nov 25 23:50:58 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Martel Unit Test Cases Message-ID: <000901c05764$87ca44a0$010a0a0a@cadence.com> The UnitTest cases pass now, on Martel 3.5, except for the newline and a test case involving backslashed backslashes ( test_n2 ). These also fail in version 3.0. Cayte From katel at worldpath.net Sun Nov 26 23:28:37 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: <001201c0537d$45c30500$9bac323f@josiah> Message-ID: <004501c0582a$84228760$010a0a0a@cadence.com> ----- Original Message ----- From: "Andrew Dalke" To: Sent: Monday, November 20, 2000 9:38 PM Subject: Re: [Biopython-dev] next release closer (?) > Jeff on the error Cayte's getting: > >> I get an error from test_prosite. My OS is Win98. > >> > >> test test_prosite crashed -- exceptions.TypeError : an integer is > required > > > >I don't know. Andrew? > > > >One thing you can try, is to run: > >python test_prosite.py > > > >and see the full stack dump that's generated. > > I would need to see the stack trace. I cannot reproduce the error using > the current CVS version. > > I don't see the string "an integer is required" anywhere in the Prosite > code, nor in the rest of the biopython distribution. Looking at the > source code for Python, that only arises during a conversion to int. > So I would need to find out which call to int is failing and the text > that it's trying to convert. > C:\biopython-0.90-d03\Tests>python test_prosite.py Patterns: 'A.' 'A' '(A)' Traceback (most recent call last): File "test_prosite.py", line 88, in ? m = p.search(Seq.Seq(x)) File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search m = self.grouped_re.search(buffer(seq.data), pos, endpos) TypeError: an integer is required Cayte From katel at worldpath.net Mon Nov 27 01:33:07 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) References: <001201c0537d$45c30500$9bac323f@josiah> <004501c0582a$84228760$010a0a0a@cadence.com> Message-ID: <005801c0583b$e8218700$010a0a0a@cadence.com> ----- Original Message ----- From: "Cayte" To: "Andrew Dalke" ; Sent: Sunday, November 26, 2000 8:28 PM Subject: Re: [Biopython-dev] next release closer (?) > > ----- Original Message ----- > From: "Andrew Dalke" > To: > Sent: Monday, November 20, 2000 9:38 PM > Subject: Re: [Biopython-dev] next release closer (?) > > > > Jeff on the error Cayte's getting: > > >> I get an error from test_prosite. My OS is Win98. > > >> > > >> test test_prosite crashed -- exceptions.TypeError : an integer is > > required > > > > > >I don't know. Andrew? > > > > > >One thing you can try, is to run: > > >python test_prosite.py > > > > > >and see the full stack dump that's generated. > > > > I would need to see the stack trace. I cannot reproduce the error using > > the current CVS version. > > > > I don't see the string "an integer is required" anywhere in the Prosite > > code, nor in the rest of the biopython distribution. Looking at the > > source code for Python, that only arises during a conversion to int. > > So I would need to find out which call to int is failing and the text > > that it's trying to convert. > > > C:\biopython-0.90-d03\Tests>python test_prosite.py > Patterns: 'A.' 'A' '(A)' > Traceback (most recent call last): > File "test_prosite.py", line 88, in ? > m = p.search(Seq.Seq(x)) > File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search > m = self.grouped_re.search(buffer(seq.data), pos, endpos) > TypeError: an integer is required > > Cayte > Its OK with the laest Pattern.py From dalke at acm.org Thu Nov 30 00:30:56 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) Message-ID: <013b01c05a8e$b7ea10c0$62ac323f@josiah> Cayte: >> C:\biopython-0.90-d03\Tests>python test_prosite.py >> Patterns: 'A.' 'A' '(A)' >> Traceback (most recent call last): >> File "test_prosite.py", line 88, in ? >> m = p.search(Seq.Seq(x)) >> File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search >> m = self.grouped_re.search(buffer(seq.data), pos, endpos) >> TypeError: an integer is required >> >> Cayte >> > Its OK with the laest Pattern.py I checked in the CVS logs since I wanted to ensure that it was a proper code fix and not some side effect of perhaps another bug. Looks like Brad fixed that on 2000/09/27 with the following: < m = self.grouped_re.search(buffer(seq.data), pos, endpos) --- > if endpos: > m = self.grouped_re.search(buffer(seq.data), pos, endpos) > else: > m = self.grouped_re.search(buffer(seq.data), pos) 173c176,179 < m = self.grouped_re.match(buffer(seq.data), pos, endpos) --- > if endpos: > m = self.grouped_re.match(buffer(seq.data), pos, endpos) > else: > m = self.grouped_re.match(buffer(seq.data), pos) This would indeed have caused the problem you identified, and updating to the newer version properly fixed it. The base reason for the problem was a difference between Python 1.5.2's re module and 2.0's sre. In the first module, the "search" method is defined in Python as: def search(self, string, pos=0, endpos=None): in the second, it's defined in C as in start = 0; int end = INT_MAX; ... if (!PyArg_ParseTupleAndKeywords(args, kw, "O|ii:search", kwlist, &string, &start, &end)) which when translated into Python is def search(self, string, pos=0, endpos=sys.maxint): There's little anyone could have done to guard against this change in the underlying Python API. Also, BTW, when we make the change to Python 2.0, I suggest changing Pattern.py's Prosite.search so that endpos defaults to sys.maxint instead of the None it does now. This keeps it compatible with the Python API and prevents the if-branches in the code - I don't like branches since they are harder to test fully. Andrew From dalke at acm.org Thu Nov 30 00:51:10 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) Message-ID: <015301c05a91$8b547660$62ac323f@josiah> >> if endpos: >> m = self.grouped_re.search(buffer(seq.data), pos, endpos) >> else: >> m = self.grouped_re.search(buffer(seq.data), pos) Oops. Just realized this code contains a bug when endpos == 0. The test should instead be for if endpos is not None: ... Fixed in CVS. Andrew From chapmanb at arches.uga.edu Thu Nov 30 15:27:07 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <013b01c05a8e$b7ea10c0$62ac323f@josiah> References: <013b01c05a8e$b7ea10c0$62ac323f@josiah> Message-ID: <14886.47131.653099.144288@taxus.athen1.ga.home.com> [Cayte's Prosite problem] Andrew: > I checked in the CVS logs since I wanted to ensure that it was a proper > code fix and not some side effect of perhaps another bug. Looks like > Brad fixed that on 2000/09/27 with the following: [change because of a different default argument in python 2.0] Doh! Sorry, that I didn't say anything about this -- I'd actually forgotten about this fix and it didn't cross my mind that Cayte's problem could be related to it. This is my fault, I should have posted to the dev list about this... [my "fix"] > Oops. Just realized this code contains a bug when endpos == 0. Double Doh! Thanks for the fix on this. I apologize again, I should have posted to the list on this -- I was just thinking I was making a "simple" change, but should have been more careful. Since that time I've become a lot more paranoid, and started posting patches for other people's code instead of fixing directly in CVS, and this is a good reason why I should do this. This brings up a point -- does anyone think it would be worthwhile to have CVS commits and log messages sent to the dev list? Bioperl has this and I think it's very worthwhile -- then for cases like this I would feel more comfortable going ahead with a small "fix" because I know Andrew would read the log... Then he could think: "hey, what's this punk doing messing with my code?" and go in and check up on the fix, if he feels like it. Just an idea, but maybe posting patches is better... I would really like to have bugs sent to the dev list when they come in -- I just noticed a couple from Iddo that I should have dealt with (I think that is all fixed now, regardless), but didn't realize were there. Whadda you all think about this? > Also, BTW, when we make the change to Python 2.0, I suggest changing > Pattern.py's Prosite.search so that endpos defaults to sys.maxint > instead of the None it does now. This keeps it compatible with the > Python API and prevents the if-branches in the code - I don't like > branches since they are harder to test fully. This is true -- your fix would be better, if you are not worried about 1.5.2 compatibility. As far as I can tell, we are officially requiring 2.0 and no one seems to mind, so if you personally aren't worried about people having to have 2.0 to use Prosite, then I give a big +1 to switching to the more stable code. This way I won't have to stay up nights worrying about more bugs in my "fixes" :-). Brad From dalke at acm.org Thu Nov 30 22:43:46 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) Message-ID: <005601c05b49$187c1860$73ac323f@josiah> Brad: >Doh! Sorry, that I didn't say anything about this -- I'd actually >forgotten about this fix and it didn't cross my mind that >Cayte's problem could be related to it. This is my fault, I should >have posted to the dev list about this... Couple of points. I know I tend to forget the details of things after a couple of months, and I don't expect others to have better memories. In this case, my first thought to the problem was that some string wasn't being converted to an integer - which wasn't the case - so there wasn't much of a clue to jog your memory. Secondly, even if you posted to the list two months ago when you did the fix, the odds of me (or anyone else) remembering is also pretty low. >This brings up a point -- does anyone think it would be worthwhile to >have CVS commits and log messages sent to the dev list? Bioperl has >this and I think it's very worthwhile A couple of months ago there was a bug report on the bioperl list (not the dev list, the general one). As I recall, someone reported a problem in BLAST parsing where it didn't understand one of the fasta|id|label forms. It turns out the code had been fixed and the problem was that the person who reported the bug hadn't tried the newer bioperl release. It took a while for there to be any response regarding the problem. Part of the reason was the poor bug report, but the other, more major part was likely that no one remembered that there had been a change/fix. After all, it had been 6 months previous. This despite that bioperl has the CVS email notifications and they have both more developers and more people using the BLAST parser. It was much easier just to go to the CVS logs for the appropriate file and see all the changes at once; which is what I did to track down how Cayte's problem disappeared. Therefore, I do not think that CVS email notifications would really help out for this case. That's not saying that email notification don't have other uses. Two I can think of are "hey, what's this punk doing messing with my code?" and status updates. The first of these can be done with other tools, like looking at which files changed when doing a cvs update, or using the cvs log to see the list of changes. I didn't use the best of phrases for the latter of these. It's an idea I picked up from McConnell's "Rapid Development" (a book which I fully recommend, btw). He suggests breaking a project up into "mini-milestones", which are tasks that can be completed within a couple of days. When the task is completed, the developer sends out a short email to the group saying it's done. It might also point out how to use the new feature or describe that it's 100 times faster than the older code or .... The result helps improve communications, helps the project manager track the task timelines, and gives everyone a bit of good news that things are getting done. I think CVS updates are too fine grained for this level of communications. They report on the changes done on a per-file basis and not on a per-task or per-bug basis. When you read the email notification you need to reconstruct what's going one. (You still need to do that when looking at the cvs log, but then you can use cvs diff to see the actual code changes and you have the code right there to look through.) Also, I get enough email as it is now - I don't want to get email for every bug report (esp. ones like "Oops, fixed typo in 'protien'") Therefore, I still don't think that automatic email notification of CVS changes is all that useful an ability. -- then for cases like this I >would feel more comfortable going ahead with a small "fix" because I >know Andrew would read the log... Then he could think: and go in and check up on the >fix, if he feels like it. Just an idea, but maybe posting patches is >better... >I would really like to have bugs sent to the dev list when they come >in -- I just noticed a couple from Iddo that I should have dealt with >(I think that is all fixed now, regardless), but didn't realize were >there. Whadda you all think about this? Bugs are different. Unless there's someone willing to triage bugs and pass them on the right person (and hopefully the person will respond) it might as well go to everyone. Plus, as I've said, I don't like having a lot of email so there's a negative feedback loop to reduce the bug count :) So I've no problems with this. Though in the future if there are both a lot of bugs and a lot of different development, something will need to be done to make sure there is some way to direct the right messages to the right people. (Improving signal to noise.) >> Also, BTW, when we make the change to Python 2.0, I suggest changing >> Pattern.py's Prosite.search so that endpos defaults to sys.maxint >> instead of the None it does now. This keeps it compatible with the >> Python API and prevents the if-branches in the code - I don't like >> branches since they are harder to test fully. >As far as I can tell, we are officially requiring 2.0 and no one seems >to mind, I thought the switchover to 2.0 wasn't going to occur until after the next release (the one that's coming closer (?) :) So I was going to wait until then - so long as I remember. > This way I won't have to stay up nights worrying about more bugs > in my "fixes" :-). There is an extreme viewpoint to this. As I understand XP, any desired behaviour should have a test for it. This allows people to change the code and - so long as the tests still pass - assume the changes are valid. This doesn't work in the most literal sense since I could have code like if endpos == 87655: endpos = endpos + 8 and there's no way people will write a test for every possible input combination. On the other hand, it is a good practice to test boundary conditions, so there could (should?) be a test for endpos = None and endpos = 0. Had they been present, your bug would have been found right away. So one way to sleep more comfortably is to add regression tests. While you then lose sleep worrying that you aren't testing everything, I've found I gain more than I lose. Andrew dalke@acm.org