From jchang at SMI.Stanford.EDU Fri Mar 2 09:58:05 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Upcoming release In-Reply-To: <15001.57237.430446.843956@taxus.athen1.ga.home.com> Message-ID: Oops, I knew that that was going to mess something up! Better pain now than later... Jeff On Sun, 25 Feb 2001, Brad Chapman wrote: > [Problems with unigene tests] > > > I couldn't find a reference to unigene_format in my latest version. > > unigene_format.py is a fossil from my attempt to mesh it with Martel. > > Andrew and I agreed that sgmllib would be a better choice for data that > > makes heavy use of html. > > > > I commited Unigene.py again un case the old version was still in the > > database. Let me know if you still have a problem. > > Ah ha! I think I know the problem. Jeff moved UniGene.py code to > UniGene/__init__.py, since we decided to do that across all of the > modules to make imports easier. > UniGene/UniGene.py is officially deleted in CVS, so checkouts > won't get it -- if you could copy your current code base to > UniGene/__init__.py and commit that, then hopefully things will work > again. > > Brad > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From jchang at SMI.Stanford.EDU Fri Mar 2 10:10:07 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Upcoming release In-Reply-To: <15000.63223.754238.992950@taxus.athen1.ga.home.com> Message-ID: Hi Brad, Thanks for doing this! I've started running pyunit_testing, and it looks really nice. A few things to do before we incorporate it: - it should respect command line arguments and run only the tests specified. For example, If I do "python pyunit_test.py test_translate," it should only run the test_translate test. - in compare_output: # normalize the newlines in the two lines expected_line = string.strip(expected_line) output_line = string.strip(output_line) It looks like this is stripping the whitespace from the lines before they're being compared. This may cause problems, because it won't catch whitespace related errors, such as formatting. Instead, it should convert the ending newlines into some canonical form, e.g. '\n'. Are you going to have a chance to fix these *real soon*? If so, then perhaps we can squeeze it into this release. Jeff On Sun, 25 Feb 2001, Brad Chapman wrote: > [I was working on a PyUnit framework for integrating the tests] > > Jeff: > > I'm glad someone's looking seriously into this! It sounds like > > something for the next release, though... > > Okay, well I completely ignored your message and worked more on this > :-). In the shower this morning I thought of some ways to fix the > problems we've been having, using the PyUnit framework I posted > yesterday. > > It seems like I've got the regression comparisons working now so I > implemented a "replacement" for br_regrtest.py that uses PyUnit. The > only downside of the comparisons now is that it reads the entire > output into a string, and then does the comparison, but I can't ever > imagine that an output would be so incredibly huge this would be a > problem (otherwise the test should probably be split up!). > > I used the fancy pyunit GUI stuff, so now the tests run by default > with a little Tk GUI (should be nicer on Windows, and especially > nicer for Macs). > > This all works for me okay on both Unix and Windows. > > What do people think? Does anyone have time to look at this before > next weeks release, or do you all want to put it off until after? > > BTW, I noticed some problems with the tests while doing this, which I > can now attribute to actual problems: > > o test_NCBIWWW is failing right now, due to problems in comparing the > output (and these are not due to newline problems). I looked in the > logs to see what had changed, and it looks like Thomas checked in an > output change, but there wasn't a corresponding change to the > tests. > > o test_SubsMat -- this seems to be failing on windows due to the fact > that Windows prints -0.00 and the output is 0.00. I guess this is a > Windows/UNIX difference. It's probably not worth worrying about since > -0.00 and 0.00 are the same thing (as far as I know :-). > > Brad > > From jchang at SMI.Stanford.EDU Fri Mar 2 10:12:04 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] release Message-ID: Hello everybody, Looks like a good time to do the release. Please let me know if things are still in flux. Brad, how do you make the documentation? Do you have time to do that, or should I try and muddle through it? Thanks, Jeff From chapmanb at arches.uga.edu Fri Mar 2 16:39:00 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Upcoming release In-Reply-To: References: <15000.63223.754238.992950@taxus.athen1.ga.home.com> Message-ID: <15008.4852.54946.519338@taxus.athen1.ga.home.com> Hi Jeff! [PyUnit based regression test] > - it should respect command line arguments and run only the tests > specified. For example, If I do "python pyunit_test.py test_translate," > it should only run the test_translate test. Good idea! I missed this feature of br_regrtest.py. I added this. It won't work with the GUI (ie. if you want a nice GUI, you have to run all of the tests), but otherwise it seems to work. > - in compare_output: > # normalize the newlines in the two lines > expected_line = string.strip(expected_line) > output_line = string.strip(output_line) > > It looks like this is stripping the whitespace from the lines before > they're being compared. This may cause problems, because it won't catch > whitespace related errors, such as formatting. Instead, it should convert > the ending newlines into some canonical form, e.g. '\n'. Okay, I did this, although I don't know if I totally agree. In my opinion it should be string.rstrip(expected_line) (I shoulda used rstrip in the first place!). The reason I am for this is that I really think we don't have to be that picky about the whitespace at the end of the line. A case in point is that changing this led test_prodoc to fail with: Output : 'J.\n' Expected: 'J. \n' This just doesn't seem like a good reason to fail. But right now you get your way, I'm just arguing my point :-) > Are you going to have a chance to fix these *real soon*? If so, then > perhaps we can squeeze it into this release. Okay, is this real soon enough for ya? :-). Instead of sending that whole darn thing again, I just checked it into CVS. I checked in the file as Tests/run_tests.py, and also included the necessary stuff from PyUnit. Let me know if you don't like the names or anything and we can change 'em. I just thought this way might give some other people a chance to CVS update and make sure this testing mechanism works for them . I also made a small change to setup.py, so now you can run: python setup.py test and run the tests. This means when installing you can do: python setup.py build python setup.py install python setup.py test just like perl :-). The fun never stops with distutils! Brad From chapmanb at arches.uga.edu Fri Mar 2 16:44:27 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] release In-Reply-To: References: Message-ID: <15008.5179.165629.931533@taxus.athen1.ga.home.com> Jeff: > Looks like a good time to do the release. Yup, seems good. I guess there is only one request I have before release: Can we fix the tests that are failing? I think it would be nice if people could install biopython and not have tests failing on them :-). It seems like just some minor adjustments are all we need to do. > Brad, how do you make the documentation? Do you have > time to do that, or should I try and muddle through it? No problem, I can make it and send you the pdf. I'll do that tommorrow. Brad From jchang at SMI.Stanford.EDU Fri Mar 2 19:14:57 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] release In-Reply-To: <15008.5119.970652.693456@taxus.athen1.ga.home.com> Message-ID: [Brad, for release] > Can we fix the tests that are failing? I think it would be nice if > people could install biopython and not have tests failing on them > :-). It seems like just some minor adjustments are all we need to do. Yeah, definitely. I thought they were working now? No, oops, my test_NCBIWWW and test_prodoc are breaking now. Dang. I'll take a look at these... > > Brad, how do you make the documentation? Do you have > > time to do that, or should I try and muddle through it? > > No problem, I can make it and send you the pdf. I'll do that > tommorrow. Thanks! Jeff From katel at worldpath.net Tue Mar 6 22:14:37 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank Message-ID: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> Today, with over a foot of snow out the window and still falling, I wasn't going anywhere, so I decided to take a closer look at GenBank and Martel and TextTools. _EventGenerator in GenBank looks like generic glue code that could be used in my Kabat parser. I could cut and paste it but duplicate code adds to the maintenance load. Andrew, I suggest moving it to Tools?? Cayte From jchang at SMI.Stanford.EDU Tue Mar 6 20:39:37 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank In-Reply-To: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> Message-ID: Looks more appropriate for Bio/ParserSupport.py. Jeff On Tue, 6 Mar 2001, Cayte wrote: > Today, with over a foot of snow out the window and still falling, I wasn't > going anywhere, so I decided to take a closer look at GenBank and Martel and > TextTools. _EventGenerator in GenBank looks like generic glue code that > could be used in my Kabat parser. I could cut and paste it but duplicate > code adds to the maintenance load. Andrew, I suggest moving it to Tools?? > > Cayte > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From dalke at acm.org Fri Mar 9 01:13:09 2001 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank Message-ID: <00f401c0a860$04db94c0$2aac323f@josiah> Cayte and everyone else, I want to apologize for my lack of response the last few weeks. I looked for then moved into my new house, followed closely by a couple weeks working in Sweden. This week is full with three different conferences and next week I'll be at another job in San Francisco. So I haven't and won't soon have time to reply to anyone's email. BTW, I presented Martel at the Python conference yesterday. Talked to some interested people about it afterwards, both with Martel and with Biopython. Andrew dalke@acm.org From chapmanb at arches.uga.edu Sat Mar 10 14:19:45 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank In-Reply-To: References: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> Message-ID: <15018.32337.572335.52137@taxus.athen1.ga.home.com> Cayte: > _EventGenerator in GenBank looks like generic glue code that > could be used in my Kabat parser. I could cut and paste it but duplicate > code adds to the maintenance load. Andrew, I suggest moving it to Tools?? Jeff: > Looks more appropriate for Bio/ParserSupport.py. I'm glad this code can be of more general use. I just cleaned up _EventGenerator a bit (it had some ugly GenBank-specific stuff in it) and moved it into ParserSupport.py, as Jeff suggested. It's called EventGenerator(), and now GenBank uses it from ParserSupport. One thing I was worried about is that EventGenerator is an XML handler, so it derives from an XML handler class. I didn't want to make ParserSupport only be usable if the user had XML installed (since non-Martel-based parsers use ParserSupport) so EventGenerator will only be conditionally created if the XML import goes okay, otherwise an error message will be printed. Let me know if this works for your needs, Cayte, and if it seems okay to everyone else. Brad From chapmanb at arches.uga.edu Sun Mar 11 17:28:54 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Biopython SwissProt bugs Message-ID: <15019.64550.202020.918380@taxus.athen1.ga.home.com> Hi OEyvind; (My sincerest apologies if I'm mangling your first name) Thanks for the bug reports for SwissProt (bugs 23 and 24 for those on the dev list). I fixed both of the problems, and have attached patches against the latest CVS version (which should also work with the latest release 1.00a1). I'm not positive if they will apply cleanly to 0.90d04, but that might be a good excuse for you to upgrade to the most recent release :-). For the RX line problem, it looks to me like SwissProt now uses RX lines like: RX PubMed=9603189; in some places, where I guess they used to be: RX MEDLINE; 85132727. This confused the parser, so I fixed it to check for and handle these new cases. For the second problem with the extra spaces in VARSPLIC lines, I just added a special case to check for this and correct it. These seem to work for me on the examples you provided -- if you could confirm that they work for you, that would be super. I'm sending the patch to the dev list so Jeff, the master o' SwissProt, can look it over before I check the changes and added tests into CVS. Thanks again for the bug reports. Brad -------------- next part -------------- *** SProt.py.orig Sun Mar 11 16:01:22 2001 --- SProt.py Sun Mar 11 17:14:53 2001 *************** *** 590,595 **** --- 590,598 ---- def reference_cross_reference(self, line): assert self.data.references, "RX: missing RN" + # The basic (older?) RX line is of the form: + # RX MEDLINE; 85132727. + # but there are variants of this that need to be dealt with (see below) # CLD1_HUMAN in Release 39 and DADR_DIDMA in Release 33 # have extraneous information in the RX line. Check for *************** *** 599,609 **** if ind >= 0: line = line[:ind] ! cols = string.split(line) ! assert len(cols) == 3, "I don't understand RX line %s" \ ! % line ! self.data.references[-1].references.append( ! (self._chomp(cols[1]), self._chomp(cols[2]))) def reference_author(self, line): assert self.data.references, "RA: missing RN" --- 602,634 ---- if ind >= 0: line = line[:ind] ! # RX lines can also be used of the form ! # RX PubMed=9603189; ! # reported by edvard@farmasi.uit.no ! # and these can be more complicated like: ! # RX MEDLINE=95385798; PubMed=7656980; ! # We look for these cases first and deal with them ! if string.find(line, "=") != -1: ! cols = string.split(line) ! assert len(cols) > 1, "I don't understand RX line %s" % line ! ! for info_col in cols[1:]: ! id_cols = string.split(info_col, "=") ! if len(id_cols) == 2: ! self.data.references[-1].references.append( ! (self._chomp(id_cols[0]), self._chomp(id_cols[1]))) ! else: ! raise AssertionError("I don't understand RX line %s" ! % line) ! # otherwise we assume we have the type 'RX MEDLINE; 85132727.' ! else: ! cols = string.split(line) ! # normally we split into the three parts ! if len(cols) == 3: ! self.data.references[-1].references.append( ! (self._chomp(cols[1]), self._chomp(cols[2]))) ! else: ! raise AssertionError("I don't understand RX line %s" % line) def reference_author(self, line): assert self.data.references, "RA: missing RN" *************** *** 677,683 **** --- 702,741 ---- name, from_res, to_res, old_description = self.data.features[-1] del self.data.features[-1] description = "%s %s" % (old_description, description) + + # special case -- VARSPLIC, reported by edvard@farmasi.uit.no + if name == "VARSPLIC": + description = self._fix_varsplic_sequences(description) self.data.features.append((name, from_res, to_res, description)) + + def _fix_varsplic_sequences(self, description): + """Remove unwanted spaces in sequences. + + During line carryover, the sequences in VARSPLIC can get mangled + with unwanted spaces like: + 'DISSTKLQALPSHGLESIQT -> PCRATGWSPFRRSSPC LPTH' + We want to check for this case and correct it as it happens. + """ + descr_cols = string.split(description, " -> ") + if len(descr_cols) == 2: + first_seq = descr_cols[0] + second_seq = descr_cols[1] + extra_info = '' + # we might have more information at the end of the + # second sequence, which should be in parenthesis + extra_info_pos = string.find(second_seq, " (") + if extra_info_pos != -1: + extra_info = second_seq[extra_info_pos:] + second_seq = second_seq[:extra_info_pos] + + # now clean spaces out of the first and second string + first_seq = string.replace(first_seq, " ", "") + second_seq = string.replace(second_seq, " ", "") + + # reassemble the description + description = first_seq + " -> " + second_seq + extra_info + + return description def sequence_header(self, line): cols = string.split(line) From chapmanb at arches.uga.edu Sun Mar 11 17:33:37 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Bugs forwarded to the dev list Message-ID: <15019.64833.255440.75502@taxus.athen1.ga.home.com> Hey all; I think I figured out what needs to be done so that bugs submitted to the bug tracking system will be forwarded to the dev list. By grepping over the bioperl-bug to find out what they do, I got an idea. I think we need to modify the file: /home/biopython-bugs/bug_tracking/incoming/.notify so that it contains the line: biopython-dev@biopython.org and I'm guessing that this will do it. I don't have permissions to do this and test it out -- could someone with higher power on the bioperl.org server (Jeff?) test out my hypothesis for me? Thanks much! Brad From jchang at SMI.Stanford.EDU Mon Mar 12 01:07:10 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Bugs forwarded to the dev list In-Reply-To: <15019.64833.255440.75502@taxus.athen1.ga.home.com> Message-ID: Yep, I've done this. Let's see if this works... Jeff On Sun, 11 Mar 2001, Brad Chapman wrote: > Hey all; > > I think I figured out what needs to be done so that bugs submitted to > the bug tracking system will be forwarded to the dev list. By grepping > over the bioperl-bug to find out what they do, I got an idea. I think > we need to modify the file: > > /home/biopython-bugs/bug_tracking/incoming/.notify > > so that it contains the line: > > biopython-dev@biopython.org > > and I'm guessing that this will do it. I don't have permissions to do > this and test it out -- could someone with higher power on the bioperl.org > server (Jeff?) test out my hypothesis for me? > > Thanks much! > > Brad > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From jchang at SMI.Stanford.EDU Mon Mar 12 01:12:47 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] Biopython SwissProt bugs In-Reply-To: <15019.64550.202020.918380@taxus.athen1.ga.home.com> Message-ID: Thanks! I've applied this patch against the current CVS. Please test this and close the bug if it works. Also, it would be nice to have a sample entry for the regression testing suite. Jeff On Sun, 11 Mar 2001, Brad Chapman wrote: > Hi OEyvind; > > (My sincerest apologies if I'm mangling your first name) > > Thanks for the bug reports for SwissProt (bugs 23 and 24 for those on > the dev list). I fixed both of the problems, and have attached patches > against the latest CVS version (which should also work with the latest > release 1.00a1). I'm not positive if they will apply cleanly to > 0.90d04, but that might be a good excuse for you to upgrade to the > most recent release :-). > > For the RX line problem, it looks to me like SwissProt now uses RX > lines like: > > RX PubMed=9603189; > > in some places, where I guess they used to be: > > RX MEDLINE; 85132727. > > This confused the parser, so I fixed it to check for and handle these > new cases. > > For the second problem with the extra spaces in VARSPLIC lines, I just > added a special case to check for this and correct it. > > These seem to work for me on the examples you provided -- if you could > confirm that they work for you, that would be super. > > I'm sending the patch to the dev list so Jeff, the master o' SwissProt, > can look it over before I check the changes and added tests into CVS. > > Thanks again for the bug reports. > > Brad > > From katel at worldpath.net Wed Mar 14 03:47:53 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank References: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> <15018.32337.572335.52137@taxus.athen1.ga.home.com> Message-ID: <003201c0ac63$786e2cc0$010a0a0a@cadence.com> ----- Original Message ----- From: "Brad Chapman" To: Sent: Saturday, March 10, 2001 11:19 AM Subject: Re: [Biopython-dev] reusable code in Genbank > Cayte: > > _EventGenerator in GenBank looks like generic glue code that > > could be used in my Kabat parser. I could cut and paste it but duplicate > > code adds to the maintenance load. Andrew, I suggest moving it to Tools?? > > Jeff: > > Looks more appropriate for Bio/ParserSupport.py. > > I'm glad this code can be of more general use. I just cleaned up > _EventGenerator a bit (it had some ugly GenBank-specific stuff in it) > and moved it into ParserSupport.py, as Jeff suggested. It's called > EventGenerator(), and now GenBank uses it from ParserSupport. > > One thing I was worried about is that EventGenerator is an XML > handler, so it derives from an XML handler class. I didn't want to > make ParserSupport only be usable if the user had XML installed (since > non-Martel-based parsers use ParserSupport) so EventGenerator will > only be conditionally created if the XML import goes okay, otherwise > an error message will be printed. > . Since this post, I've noticed more opportunities to reuse code. I think the class Referennce.py in Genbank is close enough to the Kabat references to use as a base class. All of us may be able to create more efficient code by exchanging ideas and refactoring rather than just eaxh writing our own code Cayte From chapmanb at arches.uga.edu Wed Mar 14 05:33:21 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank In-Reply-To: <003201c0ac63$786e2cc0$010a0a0a@cadence.com> References: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> <15018.32337.572335.52137@taxus.athen1.ga.home.com> <003201c0ac63$786e2cc0$010a0a0a@cadence.com> Message-ID: <15023.18673.285503.836646@taxus.athen1.ga.home.com> Cayte: > . Since this post, I've noticed more opportunities to reuse code. I think > the class Referennce.py in Genbank is close enough to the Kabat references > to use as a base class. For reusing representations of common objects like this, I think the best way to go is to the use the Seq, SeqRecord and SeqFeature stuff. In my mind, these are supposed to represent common reusable objects, like references. On the other hand, the Reference.py is GenBank is meant to be an "exactly like GenBank" representation of a reference, for people who only care about GenBank. Because of this, this class will change if GenBank changes. So, if you want to reuse a reference object, I think the best thing to do would be to use the Reference class in Bio/SeqFeature.py. This is what this class (and everything else in SeqFeature was designed for). Let me know if this doesn't work for what you need. Brad From jchang at SMI.Stanford.EDU Wed Mar 14 16:47:36 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] SProt.py bug fixes Message-ID: Hello everybody, I've updated the Swiss Prot parser to handle the format changes in the most recent release of trembl and swiss-prot. This includes: - allows multiple '=' in RX line (patch from Brad) - broken VARSPLIC in feature table (patch from Brad) - allows multiple '=' in RC line - RL line is now optional - supports OX line continuations Jeff From chapmanb at arches.uga.edu Wed Mar 14 18:23:04 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] New BLAST web page Message-ID: <15023.64856.655441.500876@taxus.athen1.ga.home.com> Hi Jeff; Thanks for the SwissProt fixes -- I kind of suspected there might be more changes then what I fixed, but I know nothing about SwissProt since I've never used it, so I'm glad an expert got the chance to look at it! I was just curious -- have you given any thought to messing around with the new BLAST page and CGI script? I started with what you had for the old BLAST and modified it (what I have so far is below) for the new format and variables, but am pretty stuck. What I have right now will just keep giving me the query page. I didn't know if you had any suggestions or thoughts on this. I'm not sure if I am missing something fundamental, or if the new pages are just harder to work with. Thanks! Brad def new_blast(program, database, query, entrez_query = '(none)', filter = 'L', expect = '10', other_advanced = None, show_overview = 'on', ncbi_gi = 'on', format_object = 'alignment', format_type = 'html', descriptions = '100', alignments = '50', alignment_view = 'Pairwise', auto_format = 'on', cgi='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi', timeout = 20): """Blast against the NCBI Blast web page. This uses the NCBI web page cgi script to BLAST, and returns a handle to the results. See: http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html for more descriptions about the options. Options: o program - The name of the blast program to run (ie. blastn, blastx...) o database - The database to search against (ie. nr, dbest...) o query - The input for the search, which NCBI tries to autodetermine the type of. o entrez_query - A query to limit the sequences searched against. o filter - Filtering for the input sequence. o expect - The expect value cutoff to include. """ params = {'PROGRAM' : program, 'DATABASE' : database, 'QUERY' : query, 'ENTREZ_QUERY' : entrez_query, 'FILTER' : filter, 'EXPECT' : expect, 'OTHER_ADVANCED': other_advanced, 'SHOW_OVERVIEW' : show_overview, 'NCBI_GI' : ncbi_gi, 'FORMAT_OBJECT' : format_object, 'FORMAT_TYPE' : format_type, 'DESCRIPTIONS' : descriptions, 'ALIGNMENTS' : alignments, 'ALIGNMENT_VIEW' : alignment_view, 'AUTO_FORMAT' : auto_format} variables = {} for k in params.keys(): if params[k] is not None: variables[k] = str(params[k]) # This returns a handle to the HTML file that points to the results. handle = NCBI._open(cgi, variables, get = 0) # Now parse the HTML from the handle and figure out how to retrieve # the results. refcgi, params = _parse_blast_ref_page(handle, cgi) start = time.time() while 1: # Sometimes the BLAST results aren't done yet. Look at the page # to see if the results are there. If not, then try again later. handle = NCBI._open(cgi, params, get=0) ready, results, refresh_delay = _parse_blast_results_page(handle) if ready: break # Time out if it's not done after timeout minutes. if time.time() - start > timeout*60: raise IOError, "timed out after %d minutes" % timeout # pause and try again. time.sleep(refresh_delay) return File.UndoHandle(File.StringHandle(results)) From katel at worldpath.net Thu Mar 15 02:48:11 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank References: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com><15018.32337.572335.52137@taxus.athen1.ga.home.com><003201c0ac63$786e2cc0$010a0a0a@cadence.com> <15023.18673.285503.836646@taxus.athen1.ga.home.com> Message-ID: <002f01c0ad24$49c2b960$010a0a0a@cadence.com> ----- Original Message ----- From: "Brad Chapman" To: Sent: Wednesday, March 14, 2001 2:33 AM Subject: Re: [Biopython-dev] reusable code in Genbank > Cayte: > > . Since this post, I've noticed more opportunities to reuse code. I think > > the class Referennce.py in Genbank is close enough to the Kabat references > > to use as a base class. > > For reusing representations of common objects like this, I think the > best way to go is to the use the Seq, SeqRecord and SeqFeature > stuff. In my mind, these are supposed to represent common reusable > objects, like references. On the other hand, the Reference.py is > GenBank is meant to be an "exactly like GenBank" representation of a > reference, for people who only care about GenBank. Because of this, > this class will change if GenBank changes. > Seq.SeqRecord does not contain journal, author, pubmed number. Both Kabat and Genbank references conain these components. Cayte From jchang at SMI.Stanford.EDU Thu Mar 15 00:34:34 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] New BLAST web page In-Reply-To: <15023.64856.655441.500876@taxus.athen1.ga.home.com> Message-ID: Hi Brad, Thanks for looking at this! Looks like our friends at NCBI have changed the BLAST pages again and added a bunch of new options. Looking at the page source for the CGI page, they seem to have added a bunch of hidden fields that the CGI script isn't happy to be without: Get the URL with preset values ? Just to check, I added a few lines to your new_blast function to see if it would make the script happier: variables = {} for k in params.keys(): if params[k] is not None: variables[k] = str(params[k]) # This returns a handle to the HTML file that points to the results. variables['CLIENT'] = 'web' variables['SERVICE'] = 'plain' variables['PAGE'] = 'Proteins' variables['CMD'] = 'Put' handle = NCBI._open(cgi, variables, get = 0) This gets it past the first page, and now gets to the page that tells you to wait for the results. However, it's been on that page for a while, so I don't know if this is completely going to work, or if NCBI is just slow now! Jeff On Wed, 14 Mar 2001, Brad Chapman wrote: > Hi Jeff; > Thanks for the SwissProt fixes -- > I kind of suspected there might be more changes then what I fixed, but > I know nothing about SwissProt since I've never used it, so I'm glad > an expert got the chance to look at it! > > I was just curious -- have you given any thought to messing around > with the new BLAST page and CGI script? I started with > what you had for the old BLAST and modified it (what I have so far is > below) for the new format and variables, but am pretty stuck. > What I have right now will just keep giving me the query page. > > I didn't know if you had any suggestions or thoughts on this. I'm not > sure if I am missing something fundamental, or if the new pages are > just harder to work with. Thanks! > > Brad > > def new_blast(program, database, query, > entrez_query = '(none)', > filter = 'L', > expect = '10', > other_advanced = None, > show_overview = 'on', > ncbi_gi = 'on', > format_object = 'alignment', > format_type = 'html', > descriptions = '100', > alignments = '50', > alignment_view = 'Pairwise', > auto_format = 'on', > cgi='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi', > timeout = 20): > """Blast against the NCBI Blast web page. > > This uses the NCBI web page cgi script to BLAST, and returns a handle > to the results. See: > > http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html > > for more descriptions about the options. > > Options: > o program - The name of the blast program to run (ie. blastn, blastx...) > o database - The database to search against (ie. nr, dbest...) > o query - The input for the search, which NCBI tries to autodetermine > the type of. > o entrez_query - A query to limit the sequences searched against. > o filter - Filtering for the input sequence. > o expect - The expect value cutoff to include. > """ > params = {'PROGRAM' : program, > 'DATABASE' : database, > 'QUERY' : query, > 'ENTREZ_QUERY' : entrez_query, > 'FILTER' : filter, > 'EXPECT' : expect, > 'OTHER_ADVANCED': other_advanced, > 'SHOW_OVERVIEW' : show_overview, > 'NCBI_GI' : ncbi_gi, > 'FORMAT_OBJECT' : format_object, > 'FORMAT_TYPE' : format_type, > 'DESCRIPTIONS' : descriptions, > 'ALIGNMENTS' : alignments, > 'ALIGNMENT_VIEW' : alignment_view, > 'AUTO_FORMAT' : auto_format} > variables = {} > for k in params.keys(): > if params[k] is not None: > variables[k] = str(params[k]) > # This returns a handle to the HTML file that points to the results. > handle = NCBI._open(cgi, variables, get = 0) > # Now parse the HTML from the handle and figure out how to retrieve > # the results. > refcgi, params = _parse_blast_ref_page(handle, cgi) > > start = time.time() > while 1: > # Sometimes the BLAST results aren't done yet. Look at the page > # to see if the results are there. If not, then try again later. > handle = NCBI._open(cgi, params, get=0) > ready, results, refresh_delay = _parse_blast_results_page(handle) > if ready: > break > # Time out if it's not done after timeout minutes. > if time.time() - start > timeout*60: > raise IOError, "timed out after %d minutes" % timeout > # pause and try again. > time.sleep(refresh_delay) > return File.UndoHandle(File.StringHandle(results)) > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From chapmanb at arches.uga.edu Thu Mar 15 02:02:34 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:57 2005 Subject: [Biopython-dev] reusable code in Genbank In-Reply-To: <002f01c0ad24$49c2b960$010a0a0a@cadence.com> References: <001d01c0a6b4$bf1f5680$010a0a0a@cadence.com> <15018.32337.572335.52137@taxus.athen1.ga.home.com> <003201c0ac63$786e2cc0$010a0a0a@cadence.com> <15023.18673.285503.836646@taxus.athen1.ga.home.com> <002f01c0ad24$49c2b960$010a0a0a@cadence.com> Message-ID: <15024.26890.637062.60332@taxus.athen1.ga.home.com> Cayte: > Seq.SeqRecord does not contain journal, author, pubmed number. Both Kabat > and Genbank references conain these components. Right, SeqRecord is a higher level class, since it needs to hold all information about a sequence. For the reference stuff, you need to look at the Bio.SeqFeature.Reference class >>> from Bio.SeqFeature import Reference >>> my_ref = Reference() >>> dir(my_ref) ['authors', 'comment', 'journal', 'location', 'medline_id', 'pubmed_id', 'title'] Brad From chapmanb at arches.uga.edu Sun Mar 18 09:23:43 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:58 2005 Subject: [Biopython-dev] New BLAST web page In-Reply-To: References: <15023.64856.655441.500876@taxus.athen1.ga.home.com> Message-ID: <15028.50415.233255.76304@taxus.athen1.ga.home.com> Hi Jeff; Thanks for helping me out with this! > Looks like our friends at NCBI have changed > the BLAST pages again and added a bunch of new options. Looking at the > page source for the CGI page, they seem to have added a bunch of hidden > fields that the CGI script isn't happy to be without: Oooh tricky, hidden fields. Thanks for the pointer on these, this gave me the push I needed to get by where I was stuck at. > This gets it past the first page, and now gets to the page that tells you > to wait for the results. However, it's been on that page for a while, so > I don't know if this is completely going to work, or if NCBI is just slow > now! Well, it didn't completely finish us up, but getting me past getting back the same query page was all I needed :-). Quite a few things have changed in parsing the pages. Since they are now using javascript (bleah!), some of the information is in new places. But, I think I got it all sorted out and have got things working. Attached is a patch against the current CVS which seems to get NCBIWWW.blast working for me again. If this seems to work good to people, I'll be happy to check it in. Also included in a patch is a change to the WWW parser. It looks like the format has changed yet again -- they now appear to be putting the database before the Query=. The new version seems to parse the new stuff correctly, and passed all of the tests with the old versions. So, with this patch it seems like we work again with NCBI Blast. Whew! I hope this works right for everyone else. Thanks again for the help, Jeff! Brad -------------- next part -------------- *** NCBIWWW.py.orig Sat Feb 10 08:32:39 2001 --- NCBIWWW.py Sun Mar 18 09:09:05 2001 *************** *** 153,164 **** --- 153,195 ---- # Brad Chapman noticed a '

' line in BLASTN 2.1.1 attempt_read_and_call(uhandle, consumer.noevent, start='

') + # 2.1.2 has the database right and blastform after the RID + database_read = 0 + if attempt_read_and_call(uhandle, consumer.noevent, start = '

'): + self._scan_database_info(uhandle, consumer) + # read until we get to a
before the Query= + read_and_call_until(uhandle, consumer.noevent, start = '
') + read_and_call(uhandle, consumer.noevent, start = '
') + database_read = 1 + # Read the Query lines and the following blank line. read_and_call(uhandle, consumer.query_info, contains='Query=') read_and_call_until(uhandle, consumer.query_info, blank=1) read_and_call_while(uhandle, consumer.noevent, blank=1) # Read the database lines and the following blank line. + # only read the database if it hasn't already been read + if not(database_read): + self._scan_database_info(uhandle, consumer) + + # Read the blast form, if it exists. + if attempt_read_and_call(uhandle, consumer.noevent, + contains='BLASTFORM'): + read_and_call_until(uhandle, consumer.noevent, blank=1) + elif attempt_read_and_call(uhandle, consumer.noevent, + start='

'):
+                 read_and_call_until(uhandle, consumer.noevent, blank=1)
+         # otherwise we'll need to scan a 
 tag
+         else:
+             read_and_call(uhandle, consumer.noevent, start = '
')
+             
+ 
+         # Read the blank lines until the next section.
+         read_and_call_while(uhandle, consumer.noevent, blank=1)
+ 
+         consumer.end_header()
+ 
+     def _scan_database_info(self, uhandle, consumer):
          attempt_read_and_call(uhandle, consumer.noevent, start='

') read_and_call(uhandle, consumer.database_info, contains='Database') read_and_call(uhandle, consumer.database_info, contains='sequences') *************** *** 166,183 **** read_and_call(uhandle, consumer.noevent, contains='problems or questions') - # Read the blast form, if it exists. - if attempt_read_and_call(uhandle, consumer.noevent, - contains='BLASTFORM'): - read_and_call_until(uhandle, consumer.noevent, blank=1) - elif attempt_read_and_call(uhandle, consumer.noevent, start='

'):
-             read_and_call_until(uhandle, consumer.noevent, blank=1)
- 
-         # Read the blank lines until the next section.
-         read_and_call_while(uhandle, consumer.noevent, blank=1)
- 
-         consumer.end_header()
- 
      def _scan_rounds(self, uhandle, consumer):
          self._scan_descriptions(uhandle, consumer)
          self._scan_alignments(uhandle, consumer)
--- 197,202 ----
***************
*** 530,566 ****
  
          consumer.end_parameters()
  
  
! def blast(program, datalib, sequence,
!           input_type='Sequence in FASTA format',
!           double_window=None, gi_list='(None)',
!           list_org = None, expect='10',
!           filter='L', genetic_code='Standard (1)',
!           mat_param='PAM30     9       1',
!           other_advanced=None, ncbi_gi=None, overview=None,
!           alignment_view='0', descriptions=None, alignments=None,
!           email=None, path=None, html=None, 
!           cgi='http://www.ncbi.nlm.nih.gov/blast/blast.cgi',
!           timeout=20
!           ):
!     """blast(program, datalib, sequence,
!     input_type='Sequence in FASTA format',
!     double_window=None, gi_list='(None)',
!     list_org = None, expect='10',
!     filter='L', genetic_code='Standard (1)',
!     mat_param='PAM30     9       1',
!     other_advanced=None, ncbi_gi=None, overview=None,
!     alignment_view='0', descriptions=None, alignments=None,
!     email=None, path=None, html=None, 
!     cgi='http://www.ncbi.nlm.nih.gov/blast/blast.cgi',
!     timeout=20) -> handle
! 
!     Do a BLAST search against NCBI.  Returns a handle to the results.
!     timeout is the number of seconds to wait for the results before timing
!     out.  The other parameters are provided to BLAST.  A description
!     can be found online at:
!     http://www.ncbi.nlm.nih.gov/BLAST/newoptions.html
  
      """
      # NCBI Blast is hard to work with.  The user enters a query, and then
      # it returns a "reference" page which contains a button that the user
--- 549,605 ----
  
          consumer.end_parameters()
  
+ def blast(program, database, query,
+           entrez_query = '(none)',
+           filter = 'L',
+           expect = '10',
+           word_size = None,
+           ungapped_alignment = 'no',
+           other_advanced = None,
+           cdd_search = 'on',
+           composition_based_statistics = None,
+           matrix_name = None,
+           run_psiblast = None,
+           i_thresh = '0.001',
+           genetic_code = '1',
+           show_overview = 'on',
+           ncbi_gi = 'on',
+           format_object = 'alignment',
+           format_type = 'html',
+           descriptions = '100',
+           alignments = '50',
+           alignment_view = 'Pairwise',
+           auto_format = 'on',
+           cgi='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi',
+           timeout = 20):
+     """Blast against the NCBI Blast web page.
+ 
+     This uses the NCBI web page cgi script to BLAST, and returns a handle
+     to the results. See:
+     
+     http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html
+ 
+     for more descriptions about the options.
+ 
+     Required Inputs:
+     o program - The name of the blast program to run (ie. blastn, blastx...)
+     o database - The database to search against (ie. nr, dbest...)
+     o query - The input for the search, which NCBI tries to autodetermine
+     the type of. Ideally, this would be a sequence in FASTA format.
+ 
+     General Options:
+     filter, expect, word_size, other_advanced
+ 
+     Formatting Options:
+     show_overview, ncbi_gi, format_object, format_type, descriptions,
+     alignments, alignment_view, auto_format
  
!     Protein specific options:
!     cdd_search, composition_based_statistics, matrix_name, run_psiblast,
!     i_thresh
  
+     Translated specific options:
+     genetic code
      """
      # NCBI Blast is hard to work with.  The user enters a query, and then
      # it returns a "reference" page which contains a button that the user
***************
*** 571,619 ****
      # page to figure out how to retrieve the results.  Then, it needs to
      # check the results to see if the search has been finished.
      params = {'PROGRAM' : program,
!               'DATALIB' : datalib,
!               'SEQUENCE' : sequence,
!               'DOUBLE_WINDOW' : double_window,
!               'GI_LIST' : gi_list,
!               'LIST_ORG' : list_org,
!               'INPUT_TYPE' : input_type,
!               'EXPECT' : expect,
                'FILTER' : filter,
                'GENETIC_CODE' : genetic_code,
!               'MAT_PARAM' : mat_param,
!               'OTHER_ADVANCED' : other_advanced,
                'NCBI_GI' : ncbi_gi,
!               'OVERVIEW' : overview,
!               'ALIGNMENT_VIEW' : alignment_view,
                'DESCRIPTIONS' : descriptions,
                'ALIGNMENTS' : alignments,
!               'EMAIL' : email,
!               'PATH' : path,
!               'HTML' : html
!               }
      variables = {}
      for k in params.keys():
          if params[k] is not None:
              variables[k] = str(params[k])
      # This returns a handle to the HTML file that points to the results.
!     handle = NCBI._open(cgi, variables, get=0)
      # Now parse the HTML from the handle and figure out how to retrieve
      # the results.
      refcgi, params = _parse_blast_ref_page(handle, cgi)
  
      start = time.time()
      while 1:
          # Sometimes the BLAST results aren't done yet.  Look at the page
          # to see if the results are there.  If not, then try again later.
          handle = NCBI._open(cgi, params, get=0)
!         ready, results, refresh_delay = _parse_blast_results_page(handle)
          if ready:
              break
          # Time out if it's not done after timeout minutes.
          if time.time() - start > timeout*60:
              raise IOError, "timed out after %d minutes" % timeout
!         # pause and try again.
!         time.sleep(refresh_delay)
      return File.UndoHandle(File.StringHandle(results))
  
  def _parse_blast_ref_page(handle, base_cgi):
--- 610,691 ----
      # page to figure out how to retrieve the results.  Then, it needs to
      # check the results to see if the search has been finished.
      params = {'PROGRAM' : program,
!               'DATABASE' : database,
!               'QUERY' : query,
!               'ENTREZ_QUERY' : entrez_query,
                'FILTER' : filter,
+               'EXPECT' : expect,
+               'WORD_SIZE' : word_size,
+               'UNGAPPED_ALIGNMENT' : ungapped_alignment,
+               'OTHER_ADVANCED': other_advanced,
+               'CDD_SEARCH' : cdd_search,
+               'COMPOSITION_BASED_STATISTICS' : composition_based_statistics,
+               'MATRIX_NAME' : matrix_name,
+               'RUN_PSIBLAST' : run_psiblast,
+               'I_THRESH' : i_thresh,
                'GENETIC_CODE' : genetic_code,
!               'SHOW_OVERVIEW' : show_overview,
                'NCBI_GI' : ncbi_gi,
!               'FORMAT_OBJECT' : format_object,
!               'FORMAT_TYPE' : format_type,
                'DESCRIPTIONS' : descriptions,
                'ALIGNMENTS' : alignments,
!               'ALIGNMENT_VIEW' : alignment_view,
!               'AUTO_FORMAT' : auto_format}
      variables = {}
      for k in params.keys():
          if params[k] is not None:
              variables[k] = str(params[k])
+             
+     variables['CLIENT'] = 'web'
+     variables['SERVICE'] = 'plain'
+     variables['CMD'] = 'Put'
+ 
+     if program.upper() == 'BLASTN':
+         variables['PAGE'] = 'Nucleotides'
+     elif program.upper() == 'BLASTP':
+         variables['PAGE'] = 'Proteins'
+     elif program.upper() in ['BLASTX', 'TBLASTN','TBLASTX']:
+         variables['PAGE'] = 'Translations'
+     else:
+         raise ValueError("Unexpected program name %s" % program)
+         
      # This returns a handle to the HTML file that points to the results.
!     handle = NCBI._open(cgi, variables, get = 0)
      # Now parse the HTML from the handle and figure out how to retrieve
      # the results.
      refcgi, params = _parse_blast_ref_page(handle, cgi)
  
+     # start with the initial recommended delay. Otherwise we get hit with
+     # an extra long delay right away
+     if params.has_key("RTOE"):
+         refresh_delay = int(params["RTOE"]) + 1
+         del params["RTOE"]
+     else:
+         refresh_delay = 5
+ 
+     cgi = refcgi
      start = time.time()
      while 1:
+         # pause before trying to get the results
+         time.sleep(refresh_delay)
+         
          # Sometimes the BLAST results aren't done yet.  Look at the page
          # to see if the results are there.  If not, then try again later.
          handle = NCBI._open(cgi, params, get=0)
!         ready, results, refresh_delay, cgi = _parse_blast_results_page(handle)
!         
          if ready:
              break
          # Time out if it's not done after timeout minutes.
          if time.time() - start > timeout*60:
              raise IOError, "timed out after %d minutes" % timeout
! 
!     # now get the results page and return it
!     # -- the "ready" page from before is just a check page
!     result_handle = NCBI._open(refcgi, params, get=0)
!     results = result_handle.read()
!     
      return File.UndoHandle(File.StringHandle(results))
  
  def _parse_blast_ref_page(handle, base_cgi):
***************
*** 635,654 ****
                  if attr == 'ACTION':
                      self.cgi = urlparse.urljoin(self.cgi, value)
          def do_input(self, attributes):
!             # parse the "INPUT" tags to try and find the reference ID (RID)
!             is_rid = 0
!             rid = None
              for attr, value in attributes:
                  attr, value = string.upper(attr), string.upper(value)
!                 if attr == 'NAME' and value == 'RID':
!                     is_rid = 1
                  elif attr == 'VALUE':
!                     rid = value
!             if is_rid and rid:
!                 self.params['RID'] = rid
                  
      parser = RefPageParser(base_cgi)
!     parser.feed(handle.read())
      if not parser.params.has_key('RID'):
          raise SyntaxError, "Error getting BLAST results: RID not found"
      return parser.cgi, parser.params
--- 707,731 ----
                  if attr == 'ACTION':
                      self.cgi = urlparse.urljoin(self.cgi, value)
          def do_input(self, attributes):
!             # parse out all of the different inputs we are interested in
!             inputs = ["RID", "RTOE", "CLIENT", "CMD", "PAGE",
!                       "EXPECT", "DESCRIPTIONS", "ALIGNMENTS", "AUTO_FORMAT"]
! 
!             cur_input = None
!             
              for attr, value in attributes:
                  attr, value = string.upper(attr), string.upper(value)
!                 if attr == 'NAME' and value in inputs:
!                     cur_input = value
                  elif attr == 'VALUE':
!                     if cur_input is not None:
!                         if value:
!                             self.params[cur_input] = value
                  
      parser = RefPageParser(base_cgi)
!     html_info = handle.read()
!     
!     parser.feed(html_info)
      if not parser.params.has_key('RID'):
          raise SyntaxError, "Error getting BLAST results: RID not found"
      return parser.cgi, parser.params
***************
*** 659,679 ****
          def __init__(self):
              sgmllib.SGMLParser.__init__(self)
              self.ready = 0
              self.refresh = 5
          def handle_comment(self, comment):
!             comment = string.lower(comment)
!             if string.find(comment, 'status=ready') >= 0:
                  self.ready = 1
          _refresh_re = re.compile('REFRESH_DELAY=(\d+)', re.IGNORECASE)
!         def do_meta(self, attributes):
!             for attr, value in attributes:
!                 m = self._refresh_re.search(value)
!                 if m:
!                     self.refresh = int(m.group(1))
      results = handle.read()
      parser = ResultsParser()
      parser.feed(results)
!     return parser.ready, results, parser.refresh
  
  
  def blasturl(program, datalib, sequence,
--- 736,794 ----
          def __init__(self):
              sgmllib.SGMLParser.__init__(self)
              self.ready = 0
+             self.refresh_cgi = None
              self.refresh = 5
+ 
          def handle_comment(self, comment):
!             # determine if it is ready
!             if string.find(comment.lower(), 'status=ready') >= 0:
                  self.ready = 1
+             # otherwise, we need to parse for the delay and url
+             elif string.find(comment, 'location.href') >= 0:
+                 self.refresh_cgi, self.refresh = self._find_cgi_info(comment)
+ 
          _refresh_re = re.compile('REFRESH_DELAY=(\d+)', re.IGNORECASE)
!         def _find_cgi_info(self, comment):
!             """Find the refresh CGI string and refresh delay from a comment.
! 
!             We are parsing a comment string like:
!             setTimeout('location.href =
!             "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?
!             CMD=Get&RID=984874645-19210-15659&CHECK_STATUS_ONLY=yes&
!             REFRESH_DELAY=106&AUTO_FORMAT=yes&KEY=20111";',106000);
! 
!             Arguments:
! 
!             o comment - A comment which is assumed to have been checked to
!             have the refresh delay cgi string in it.
!             """
!             # find where the cgi string starts
!             href_string = 'location.href = "'
!             cgi_start_pos = string.find(comment, href_string)
!             assert cgi_start_pos is not -1, \
!                    "Unable to parse the start of the refresh cgi."
!             # the cgi starts at the end of the location.href stuff
!             cgi_start_pos += len(href_string)
! 
!             # find the end pos of the cgi string
!             cgi_end_pos = string.find(comment, '"', cgi_start_pos)
!             assert cgi_end_pos is not -1, \
!                    "Unable to parse end of refresh cgi."
! 
!             refresh_cgi = comment[cgi_start_pos:cgi_end_pos]
! 
!             # parse the refresh delay out of the comment
!             m = self._refresh_re.search(refresh_cgi)
!             assert m, "Failed to parse refresh time from %s" % refresh_cgi
!             refresh = int(m.group(1))
! 
!             return refresh_cgi, refresh
!                     
      results = handle.read()
+     
      parser = ResultsParser()
      parser.feed(results)
!     return parser.ready, results, parser.refresh, parser.refresh_cgi
  
  
  def blasturl(program, datalib, sequence,
From jchang at SMI.Stanford.EDU  Sun Mar 18 19:13:17 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] New BLAST web page
In-Reply-To: <15028.50415.233255.76304@taxus.athen1.ga.home.com>
Message-ID: 

Hi Brad,

Thanks for these fixes.  I've applied the patches and am trying to search
for sequences against the NCBI web site.  Unfortunately, I seem to be only
getting a piece of the results back!  The output cuts off suddenly in the
alignments section.

----------------
[...]
sp|P15398|RPA1_SCHPO
DNA-DIRECTED RNA POLYMERASE I 190 KDA POLYPEPTIDE
          Length = 1689

 Score = 28.9 bits (63), Expect = 9.0
 Identities = 12/38 (31%), Positives 

------------------

However, the web version seems to be doing the same thing, so it's very
likely not your patches.  Are you seeing the same thing?

I'm going to wait and see if this clears up over the next few days, before
I commit the patch.

Thanks,
Jeff



On Sun, 18 Mar 2001, Brad Chapman wrote:

> Hi Jeff;
> Thanks for helping me out with this! 
> 
> > Looks like our friends at NCBI have changed
> > the BLAST pages again and added a bunch of new options.  Looking at the
> > page source for the CGI page, they seem to have added a bunch of hidden
> > fields that the CGI script isn't happy to be without:
> 
> Oooh tricky, hidden fields. Thanks for the pointer on these, this gave 
> me the push I needed to get by where I was stuck at.
> 
> > This gets it past the first page, and now gets to the page that tells you
> > to wait for the results.  However, it's been on that page for a while, so
> > I don't know if this is completely going to work, or if NCBI is just slow
> > now!
> 
> Well, it didn't completely finish us up, but getting me past getting
> back the same query page was all I needed :-). Quite a few things have 
> changed in parsing the pages. Since they are now using javascript
> (bleah!), some of the information is in new places.
> 
> But, I think I got it all sorted out and have got things
> working. Attached is a patch against the current CVS which seems to
> get NCBIWWW.blast working for me again. If this seems to work good to
> people, I'll be happy to check it in.
> 
> Also included in a patch is a change to the WWW parser. It looks like
> the format has changed yet again -- they now appear to be putting the
> database before the Query=. The new version seems to parse the new
> stuff correctly, and passed all of the tests with the old versions.
> 
> So, with this patch it seems like we work again with NCBI Blast. Whew!
> I hope this works right for everyone else.
> 
> Thanks again for the help, Jeff!
> 
> Brad
> 
> 


From chapmanb at arches.uga.edu  Sun Mar 18 23:17:33 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] New BLAST web page
In-Reply-To: 
References: <15028.50415.233255.76304@taxus.athen1.ga.home.com>
	
Message-ID: <15029.34909.670828.465569@taxus.athen1.ga.home.com>

Hi Jeff;

> Thanks for these fixes.  I've applied the patches and am trying to search
> for sequences against the NCBI web site.  Unfortunately, I seem to be only
> getting a piece of the results back!  The output cuts off suddenly in the
> alignments section.

[...snip...]

> However, the web version seems to be doing the same thing, so it's very
> likely not your patches.  Are you seeing the same thing?

Hmmm, I'm not seeing this... The entire thing seems to come through
for me. I'm using the Doc/examples/www_blast.py script to do the
testing on this, if that is any help to you. NCBI was timing out
a few times for me today, so I guess things might be flakey over
there. I'll blame it on all of the web sysadmins being up too late 
last night drinking green beer :-)

> I'm going to wait and see if this clears up over the next few days, before
> I commit the patch.

Let me know if it doesn't get any better for you. It is no problem to
wait on the patch -- I'd much rather be sure it is working right!
Thanks for looking at it. 

Talk to you soon.
Brad


From dalke at acm.org  Mon Mar 19 06:41:54 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ANN: mindy-0.1
Message-ID: <001101c0b069$9d8b9880$81ac323f@josiah>

WARNING! First attempt at a generalized indexer for bioinformatics
formats using Martel.  All code experimental and subject to change
with even less notice than usual!

Code is available from
   http://www.biopython.org/~dalke/mindy-0.1.tar.gz


For the last few weeks I've been thinking about how to use Martel as
part of a generalized database indexer.  Martel of course does all the
required parsing, so it's a matter of converting the results into some
indexable format.

My first idea was to use the iterator interface to pull out a record
then pass it to XSLT to convert it into a indexed form.  Eg
 
  1433_CAEEL
  P41932
  Q21537
   ...
 

However, that proved slow because
  -  my XSLTs were taking roughly a second to process a record
       (although that was to convert the record into a SProt-like
        data structure; ie, convert all data into purely semantic XML)
  - I still don't know how to use XSLT very effectively
  - the iteratore interface is built on top of the callback interface
       so is slower

Instead I acknowledge the frequent common case where all needed fields
are strictly contained in an element and wrote a content handler which
lets you say something like

   The primary identifier is the content of the 'entry_name' element
   The record contains aliases, which are located in the 'ac_number'
   element.

I then made a command-line interface which works like

% mindy_index.py --format Martel.formats.swissprot38.format \
  --record-tag swissprot38_record --dbname swiss \
  --identifier entry_name --alias ac_number --progress 100\
  /home/dalke/ftps/swissprot/sprot38.dat

% mindy_search --dbname swiss --identifier 100K_RAT
ID   100K_RAT       STANDARD;      PRT;   889 AA.
AC   Q62671;
DT   01-NOV-1997 (Rel. 35, Created)
 ...


The indexing system uses Robin Dunn's bsddb3 interface on top of
Sleepycat Berkeley DB package.  You can get them from (respectively)

  http://pybsddb.sourceforge.net/
  http://www.sleepycat.com/download.html


To index my copy of swisprot38 using
  entry_name as the primary identifier
  ac_number as an alias
took
597.380u 145.080s 14:31.19 85.2%        0+0k 0+0io 55346pf+0w

so just under 15 minutes.  From previous timings, reading the database
and getting the id, ac and sequence fields takes about 9 minutes, so
the overhead specifically for indexing is roughly 6 minutes or 1/2 of
the time.  This is likely due to my inexperience in working with BSDDB
and can likely be reduced by a few minutes.

The final index data size is a but over 10MB:

% ls -l .mindy_dbhome/
total 9052
-rw-r-----    1 dalke    users        8192 Mar 19 11:08 __db.001
-rw-r-----    1 dalke    users      270336 Mar 19 11:08 __db.002
-rw-r-----    1 dalke    users      319488 Mar 19 11:08 __db.003
-rw-r--r--    1 dalke    users    10645504 Mar 19 11:08 swiss


The lookup time is very fast.  The command-line test is actually
limited by python's startup time.  I haven't tried timing inside of
Python.

% time env PYTHONPATH=/home/dalke/src python mindy_search.py --dbname
swiss --iden
tifier YU13_MYCTU --show-record=0 > /dev/null
0.130u 0.020s 0:00.19 78.9%     0+0k 0+0io 596pf+0w

% time env PYTHONPATH=/home/dalke/src python mindy_search.py --dbname
swiss --iden
tifier YU13_MYCTU --show-record=1 > /dev/null
0.130u 0.030s 0:00.19 84.2%     0+0k 0+0io 597pf+0w

For details on how to use the programs, run
  mindy_index.py --help
  mindy_search.py --help


The name "mindy" is derived from "Martel INDexer".


BUGS/TO DO/THOUGHTS:

There is no attempt at normalization, so searches are case and
whitespace sensitive.  This is easy to fix for the common case of
"string.lower everything and toss all ignorable whitespace".  I just
haven't done it.

The XSLT and Python function caller indexers are not implemented.

Most things aren't documented.

Haven't tested support for dealing with multiple files.

Would working with compressed files be useful?  (Even if slower for
record retrieval?)

Would like to be able to add new files to a database.

Would like to remove/update files in a database.

Would like to spawn off multiple indexers to take advantage of
multiprocessor machines - perhaps one indexer per file?  BSDDB can
support this sort of interface.

Haven't done any performance tuning.  Indeed, this is my first use of
BSDDB.

Haven't tested the 'keywords' section.

Could add a simple query language....

..But then more general purposing tools should be used (mySQL?
PostgreSQL?)

What about categories, like:
  name/* for any name
  name/swissprot-id for a swissprot-id
  reference/title contains "sequence analysis"
  xref/embl/embl-id is U05038
Okay, those really need a real database, although DOM/XPATH can
  handle some of them.  Hmm, see eXist.sourceforge.net and no
  doubt others.

Bugs section is incomplete :)

Enjoy!

                    Andrew
                    dalke@acm.org
P.S.
  I'm back from all my travels so I'll be catching up on
things (back email, bills, etc.) over the next few days.
Just thought you all would like to know if I end up sending
replies to old messages :)



From jchang at SMI.Stanford.EDU  Tue Mar 20 12:15:36 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] New BLAST web page
In-Reply-To: <15029.34909.670828.465569@taxus.athen1.ga.home.com>
Message-ID: 

Everything seems to have cleared up now.  I've committed the patch, and
blast now works against the NCBI web site again.  Thanks for the work,
Brad!

Jeff



On Sun, 18 Mar 2001, Brad Chapman wrote:

> Hi Jeff;
> 
> > Thanks for these fixes.  I've applied the patches and am trying to search
> > for sequences against the NCBI web site.  Unfortunately, I seem to be only
> > getting a piece of the results back!  The output cuts off suddenly in the
> > alignments section.
> 
> [...snip...]
> 
> > However, the web version seems to be doing the same thing, so it's very
> > likely not your patches.  Are you seeing the same thing?
> 
> Hmmm, I'm not seeing this... The entire thing seems to come through
> for me. I'm using the Doc/examples/www_blast.py script to do the
> testing on this, if that is any help to you. NCBI was timing out
> a few times for me today, so I guess things might be flakey over
> there. I'll blame it on all of the web sysadmins being up too late 
> last night drinking green beer :-)
> 
> > I'm going to wait and see if this clears up over the next few days, before
> > I commit the patch.
> 
> Let me know if it doesn't get any better for you. It is no problem to
> wait on the patch -- I'd much rather be sure it is working right!
> Thanks for looking at it. 
> 
> Talk to you soon.
> Brad
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From thomas at cbs.dtu.dk  Tue Mar 20 20:09:24 2001
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
Message-ID: <15031.65348.823345.310920@delphinus.cbs.dtu.dk>

Hej,

Has any of you planned to go to the BOSC and/or ISMB2001 meeting in Copenhagen this
summer ?

http://www.open-bio.org/bosc2001/
http://ismb01.cbs.dtu.dk/

-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From dalke at acm.org  Tue Mar 20 23:21:35 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
Message-ID: <01f201c0b1be$6bc132a0$8eac323f@josiah>

Thomas:
> Has any of you planned to go to the BOSC and/or
> ISMB2001 meeting in Copenhagen this summer ?

I will be at BOSC and perhaps also at ISMB.

Anyone looking to split a room?

                    Andrew
                    dalke@acm.org



From chapmanb at arches.uga.edu  Sat Mar 24 13:39:45 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
In-Reply-To: <15031.65348.823345.310920@delphinus.cbs.dtu.dk>
References: <15031.65348.823345.310920@delphinus.cbs.dtu.dk>
Message-ID: <15036.59889.851441.975040@taxus.athen1.ga.home.com>

Hey Thomas;
 
> Has any of you planned to go to the BOSC and/or ISMB2001 meeting 
> in Copenhagen this summer ?

Yup. I'm planning to be there -- I'm really looking forward to it --
should be fun and productive. I'm already tying to get together
import stuff like money :-).

Brad



From chapmanb at arches.uga.edu  Sun Mar 25 12:23:44 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ANN: mindy-0.1
In-Reply-To: <001101c0b069$9d8b9880$81ac323f@josiah>
References: <001101c0b069$9d8b9880$81ac323f@josiah>
Message-ID: <15038.10656.690338.212569@taxus.athen1.ga.home.com>

Hi Andrew;

> WARNING! First attempt at a generalized indexer for bioinformatics
> formats using Martel.  All code experimental and subject to change
> with even less notice than usual!

Well, despite this friendly encouragement, I decided to take a look at 
mindy anyways :-).
 
> For the last few weeks I've been thinking about how to use Martel as
> part of a generalized database indexer.  Martel of course does all the
> required parsing, so it's a matter of converting the results into some
> indexable format.

This works great for me! I got this working with the GenBank format
and checked some experimental code into Bio.GenBank to make indexing
using mindy similar to the current indexing system. Using this
requires that you put mindy inside a mindy directory on your
PYTHONPATH and make it importable by adding a __init__.py.

Adding this allowed me to mess around with Mindy using code I already
had which used the standard indexer -- some comments on this are below.
 
> Instead I acknowledge the frequent common case where all needed fields
> are strictly contained in an element and wrote a content handler which
> lets you say something like

I have one fairly common problem with in the context of GenBank
records. Almost all of the time I want to index GenBank records using
the accession number (without the version). The problem with some
GenBank records is that they look like:

LOCUS       AC006837    87584 bp    DNA             PLN       05-APR-2000
DEFINITION  Arabidopsis thaliana chromosome II section 1 of 255 of the complete
            sequence. Sequence from clones F23H14.
ACCESSION   AC006837 AE002093
VERSION     AC006837.15  GI:6598619

and have two (or more) accession numbers. I think the second one is an 
old, now defunct, accession number for the same clone. The problem I
get with just indexing with mindy using "accession" is that everything 
will be indexed using the second accession number, and not the first
like I would like. 

What do you think about a good solution to this? Is is possible to
have multiple indexes pointing to the same record (ie. both AC006837
and AE002093 point to this record)? Am I stuck using XSLT or
something else for this case?

> The indexing system uses Robin Dunn's bsddb3 interface on top of
> Sleepycat Berkeley DB package.  You can get them from (respectively)
> 
>   http://pybsddb.sourceforge.net/
>   http://www.sleepycat.com/download.html

Just curious -- why'd you decide to use Berekeley DB?

> The lookup time is very fast.  

Yup, this is *really* nice!

> BUGS/TO DO/THOUGHTS:

> Would working with compressed files be useful?  (Even if slower for
> record retrieval?)

Yes, this would be really useful, at least for me. I always end up 
uncompressing and recompressing stuff before I work with them to keep
myself from filling up my hard disk. It would nice not to have to
always go through that cycle everytime I switch between projects I'm
working on.

> Would like to be able to add new files to a database.
> 
> Would like to remove/update files in a database.

Yeah, both would be really nice! It seems like there is some support
for this (?) but I didn't play with it.

> Could add a simple query language....
> 
> ..But then more general purposing tools should be used (mySQL?
> PostgreSQL?)

Hmm, would it be hard to support multiple backends? I don't really
know anything about Berkeley DB and just installed it blindly to use
this.


Another addition which I think would be nice is storing the size of
the indexed files. This would allow you to potentially skip an
indexing when index is called on a file. If a database is already
present for a file, it checks the stored size of the file versus the
current size of the file, and then skips a new indexing if it appears
up to date. This is what bioperl does, and I think it's very
useful. Anyways, here's a patch that stores this information. The
GenBank code I wrote uses this to check the size:

$ diff -u mindy_index.py.orig mindy_index.py
--- mindy_index.py.orig	Mon Mar 19 06:15:14 2001
+++ mindy_index.py	Sun Mar 25 10:26:13 2001
@@ -2,7 +2,7 @@
 
 See the usage for more information.
 """
-
+import os
 import sys
 from xml.sax import handler
 from bsddb3 import db, dbshelve
@@ -141,6 +141,7 @@
         self.keywords = keywords
         self.filename = None
         self._filenames = {}
+        self._file_sizes = {}
         self._abbrevs = {}
 
     def add_filename(self, filename):
@@ -153,6 +154,9 @@
         self._abbrevs[filename] = str(abbrev)
 
         self.mindy_data["filenames"] = self._filenames
+
+        self._file_sizes[filename] = os.path.getsize(filename)
+        self.mindy_data["file_sizes"] = self._file_sizes
 
     def use_filename(self, filename):
         if not self._abbrevs.has_key(filename):

Just another thought.

>   I'm back from all my travels so I'll be catching up on
> things (back email, bills, etc.) over the next few days.
> Just thought you all would like to know if I end up sending
> replies to old messages :)

Nice to have you back! BTW, since you are back and I have your
attention (hopefully :-), have you thought about adding Martel to the
CVS tree? I added support for installing it to the setup.py already,
so it should be almost "ready to go" if you are still in favor of
doing this.

Brad


From dalke at acm.org  Sun Mar 25 16:42:54 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ANN: mindy-0.1
Message-ID: <005301c0b574$8e207720$8bac323f@josiah>

>Well, despite this friendly encouragement, I decided to take a look at 
>mindy anyways :-).

Yeah, Jeff pointed out that as well.  I pasted in the README,
which was meant to tell people that they shouldn't have long
term plans to expect that the code would be useable without
changes.  But that's perhaps overkill for the posting, which
is to get people to use the idea of it for the long term.

> Using this
>requires that you put mindy inside a mindy directory on your
>PYTHONPATH and make it importable by adding a __init__.py.

And to say that there would be effort to make things work.
Yeah, I did all my work in a single directory.

>I think the second one is an 
>old, now defunct, accession number for the same clone. The problem I
>get with just indexing with mindy using "accession" is that everything 
>will be indexed using the second accession number, and not the first
>like I would like.

Are you using the accession as the primary key or as an alias?
I made the assumption there will always be a primary key which
is unique but that there can be many aliases.

If you want something other than that, you would need to use
XSLT or a Python function, whose interfaces I sketched out
but did not implement.

>Is is possible to
>have multiple indexes pointing to the same record (ie. both AC006837
>and AE002093 point to this record)? Am I stuck using XSLT or
>something else for this case?

Yes.  Call them aliases.

>Just curious -- why'd you decide to use Berekeley DB?

I considered the following choices:
  - Berkeley DB
  - mySQL
  - PostgreSQL
  - Oracle

The last three require knowledge of SQL, of which I
have very little, and I wanted to get things up very
quickly.  In addition, all I wanted to do was lookups,
and BSDDB does that very well.  Plus, I liked that
BSDDB works in the local process rather than talking
to a server.

I can envision intefaces to the other databases.  Perhaps
for the future.

>> Would working with compressed files be useful?

>Yes, this would be really useful, at least for me. I always end up 
>uncompressing and recompressing stuff before I work with them to keep
>myself from filling up my hard disk.

Easy enough I think to stick a bit of code on the beginning
of the read to tell if the file is compressed or not.  I
think Python now includes some in-built modules for reading
compressed files, else popen'ing through zcat or bzcat is
pretty easy.

>> Would like to be able to add new files to a database.
>> 
>> Would like to remove/update files in a database.
>
>Yeah, both would be really nice! It seems like there is some
>support for this (?) but I didn't play with it.

There is?  Huh, didn't know about that.  Yes, it would be
nice.  :)

>Hmm, would it be hard to support multiple backends? I don't really
>know anything about Berkeley DB and just installed it blindly to use
>this.

No, it wouldn't.  But I think when you start getting into
"real" databases (meaning ones with SQL) then people want
the ability to set up their own schemas, sos the queries
they have go quickly.  Should the database created be
fully normalized (in which cases queries can be very
complex and require a lot of joins) or denormalized (which
make for easier queries but which is easier to accidently
leave in an invalid state)?

I don't think there is a solution, so the best is to
wait until someone has a need for it.  Then pay me to
write the interfaces :)  My need for now is indexed
searches, so I used a database system which is designed
for that task.  There is no possible confusion that the
result is usable for larger scale queries.

>Another addition which I think would be nice is storing the size of
>the indexed files.
>This would allow you to potentially skip an
>indexing when index is called on a file. 

Yeah, that would work.  Though there would need to be a way
to override that skipping.

>Nice to have you back! BTW, since you are back and I have your
>attention (hopefully :-), 

Oh, pardon.  You talking to me?  Sorry, I wasn't paying
attention.

>have you thought about adding Martel to the
>CVS tree?

Getting there.  Getting there.  My problem has always been
the difficulty of getting my linux box hooked up to the
world.  I finally gave in and bought some dedicated hardware
for it: http://www.egghead.com/category/inv/00042993/03297120.htm

By the end of the week I hope to start working on it.
OTOH, my laptop started acting flaky in the last few days :(
Have I mentioned that me and hardware don't get along?

                    Andrew



From thomas at cbs.dtu.dk  Mon Mar 26 01:53:46 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ANN: mindy-0.1
In-Reply-To: "Andrew Dalke"'s message of "Sun, 25 Mar 2001 14:42:54 -0700"
References: <005301c0b574$8e207720$8bac323f@josiah>
Message-ID: 

"Andrew Dalke"  writes:

> 
> >Just curious -- why'd you decide to use Berekeley DB?
> 
> I considered the following choices:
>   - Berkeley DB
>   - mySQL
>   - PostgreSQL
>   - Oracle
> 
> The last three require knowledge of SQL, of which I
> have very little, and I wanted to get things up very
> quickly.  In addition, all I wanted to do was lookups,
> and BSDDB does that very well.  Plus, I liked that
> BSDDB works in the local process rather than talking
> to a server.

Hmm. I don't think I understand what you are actually storing - how is the
indexing done ? Are you preparsing all entries during the indexing part, or
are you storing the positions of the entries via seek and get ?  (for a
simple position indexing tool ala TIGR's yank see getgene.py in biopython)
(that would also answer the alias question)
> 
> I can envision intefaces to the other databases.  Perhaps
> for the future.
> 
> >> Would working with compressed files be useful?
Always !!! - Does anybody know how to seek/tell in a gzipped file ?

> Easy enough I think to stick a bit of code on the beginning
> of the read to tell if the file is compressed or not.  I
> think Python now includes some in-built modules for reading
> compressed files, else popen'ing through zcat or bzcat is
> pretty easy.
from gzip import open ???

> No, it wouldn't.  But I think when you start getting into
> "real" databases (meaning ones with SQL) then people want
> the ability to set up their own schemas, sos the queries
> they have go quickly.  Should the database created be
> fully normalized (in which cases queries can be very
> complex and require a lot of joins) or denormalized (which
> make for easier queries but which is easier to accidently
> leave in an invalid state)?

Be careful, you are heading from a "simple" indexing scheme to a pySRS :-)

> 
> I don't think there is a solution, so the best is to
> wait until someone has a need for it.  Then pay me to
> write the interfaces :)  My need for now is indexed
> searches, so I used a database system which is designed
> for that task.  There is no possible confusion that the
> result is usable for larger scale queries.
> 
> >Another addition which I think would be nice is storing the size of
> >the indexed files.
> >This would allow you to potentially skip an
> >indexing when index is called on a file. 

Uhuh ... I don't think so, especially not if just accession numbers or ID's
are changed (e.g. from TREMBL ID yo SWISS ID) which could result in a
slightly changed db with the same size. Better to use checksum's or the
indexed accession numbers/id's (best solution, but takes more time)

> By the end of the week I hope to start working on it.
> OTOH, my laptop started acting flaky in the last few days :(
> Have I mentioned that me and hardware don't get along?

What laptop or hardware combination is causing you nightmares ?


seek-and-indexingly-y'rs
-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From dalke at acm.org  Mon Mar 26 03:16:50 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ANN: mindy-0.1
Message-ID: <004901c0b5cd$205f1800$aaab323f@josiah>

Thomas:
>Hmm. I don't think I understand what you are actually storing -
>how is the indexing done ? Are you preparsing all entries
>during the indexing part, or are you storing the positions
>of the entries via seek and get ? 

RTS,L?  :)

I'm parsing all entries through Martel.  Record boundaries
have tag events (eg, 'beginElement("swissprot38_record", ...)').
which can be used to tell where the record is located - so
long as the characters() are also counted.  This is used
to store start/end positions.  I do not save the text in
the database, although there's no reason not to do so, other
than the space duplication.  (There's no option to compress
the BSDDB.)

Inside of a record I look for text contained in other elements.
For example, text inside of 'entry_name' elements is used
for the primary key, and 'acession_number' used to get a list
of aliases.  This is used to make lookup tables to get back
to the offsets, which are used to read the record for disk.


>(for a simple position indexing tool ala TIGR's yank see
> getgene.py in biopython) 

I hadn't realized that code was there.  Its ability to
index is at some level the same, but there are quite
a few differences.  The biggest is that it is based on
Martel, so potentially anything expressed in a Martel
grammer can be indexed.  getgene is hard coded to only
work with SWISS-PROT.  That's actually a very important 
difference because by standardizing the lowest level
parsing (identification of interesting regions) makes
everything else much easier.

There are a lot of other differences between the two
approaches.  For example, I put the ID and AC fields
in different effective namespaces just in case there
are AC and ID fields which are identical but apply to
different records.  This isn't a problem in SWISS-PROT,
but I remember a few years ago I did some tests on
GenBank and there were a few dozen records repeated in
those fields.  Even what I did is incomplete for the
case of a record with multiple aliases which are from
different naming schemes when someone wants to know
XYZ's name for a record as compared to ABC's.

> (that would also answer the alias question)

I'm not sure what that question is.  Also, it looks
like that code only reads a single accession number,
specifically, the first number on the last AC line
of a record.

            elif line[:3] == 'AC ':
                acc = string.split(line)[1]
                if acc[-1] ==';': acc = acc[:-1]

There can be multiple accession numbers.


>> >> Would working with compressed files be useful?
>Always !!! - Does anybody know how to seek/tell in a gzipped file ?

That would depend on how the file is laid out, and I don't
know enough about the details of gzip'ed files.  As an
example, I know that after some number of characters the
compression table is reset, partially in case there is
a skew in the distribution of character frequencies in the
input stream.  If the number of characters is based on
the output size rather than input, then it should be
possible to jump to the next block and see if it's too
far or not.

All theoretical.  Real life may vary, and I bet it does.
 
>from gzip import open ???

Right.  Is there something for bzip2?

>> Brad:
>> >Another addition which I think would be nice is storing the size of
>> >the indexed files.
>> >This would allow you to potentially skip an
>> >indexing when index is called on a file. 

>Uhuh ... I don't think so, especially not if just accession
>numbers or ID's are changed ...
>Better to use checksum's or the
>indexed accession numbers/id's (best solution, but takes more time)

The requested functionality is a way to detect quickly if a
file has changed and not do an update if it hasn't changed.
There are lots of ways to do it:
  - file size
  - modify timestamp
  - some hash value of the whole file

File size isn't perfect, as Thomas pointed out.  The timestamp
isn't perfect because it can be copied without change, but
that affects the timestamp.  Hash value of the full file
calls for reading the full file.

Basically, there's no perfect solution so it's going to
be a tradeoff.

The other solution is to push the decision of when to
update to a different program, which is responsible
for calling the updater to tell if a file has changed.
I prefer this one because that's my usual solution for
trade-off problems - let something else figure out what
to do.

There's nothing to say this controller program
couldn't also use bsddb.

                    Andrew
                    dalke@acm.org



From thomas at cbs.dtu.dk  Mon Mar 26 04:00:49 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] pySRS (was Re: ANN: mindy-0.1)
In-Reply-To: "Andrew Dalke"'s message of "Mon, 26 Mar 2001 01:18:48 -0700"
References: <004a01c0b5cd$636bb4a0$aaab323f@josiah>
Message-ID: 

"Andrew Dalke"  writes:

> > Be careful, you are heading from a "simple" indexing
> > scheme to a pySRS :-)

Oh, great - pySRS was intended as a joke, but it seems to me we are
embarking on a new project ... fine with me :-)

> 
> What would that entail?  I'm actually pretty serious.
> What would be needed to be competitive with SRS?  From
> what I know of it, it provides:
>   1. A parsing system for identifying useful regions
>        of many formats
>   2. Icarus, a language used to implement the actions
>        of the matches in the parser
>   3. A generic data model for storing identifiers, cross
>        references, keywords and free text.
>   4. A database for storing and searching those models
>   5. A web based interface to the database
>   6. A basic set of analysis tools augmenting that interface
> 
> Martel provides 1.  Python provides 2.  I think 3 is
> pretty easy esp. by building off the data structures
> biopython already uses for these databases.  Does SRS
> have their own database for 4 or do they use an existing
> one?  In either case, off-the-shelf databases provide
> similar or better functionality.  I've done 5 and 6
> before, although complete solutions (like what
> bionavigator does) are much, much harder.

What is the advantage of Icarus ? (I have no idea). The only part I know is
that SRS uses a LOT of index files ...  

IMHO a biopython approach would include e.g. gdbm for a light version
and/or mysql or postgres for a more full-featured version (by the way, I
have moved from postgres to mysql for several reasons)
(gdbm is present on almost all unixes and can be used [similar to cpickles]
for FAST storage and retrieval of simple key-value data)

I think between 4 and 5-6 I would include a generic library/modules so that
5 and 6 could be easily extended to include web/tk/gtk/commandline/pipes
etc. Most of that (except the Icarus part) sounds familiar to me too.

So what would the perfect combination of tools look like ?

1) Martel, for the parsing system for identifying useful regions of many formats
2) Python, a language used to implement the actions of the matches in the parser
3) Biopython/Biocorba for generic data models for storing identifiers,
   cross references, keywords and free text.
4) Gdbm, MySQL databases for storing and searching those models
4b)A generic library/module encoding methods for interfacing pySRS 
5) A web/Tk/Gtk based interface to the database
6) A basic set of analysis tools augmenting that interface
7) A basic set of methods optionally to be used in all other biopython
   modules (e.g. FASTA parser's rec.nice_title() could query the accession
   found in the rec-title field and substitute it for organism + gene name
   etc.


are-we-going-to-annoy-thure?'ly y'rs
-thomas

pySRS 
SnRS: Snake Retrival System 
PSI-RS: Python System for Indexing and Retrieval of Sequences
PseudoSRS: python system for embedded .... ???hh  .... 

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From thomas at cbs.dtu.dk  Mon Mar 26 04:08:41 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
In-Reply-To: "Andrew Dalke"'s message of "Tue, 20 Mar 2001 21:21:35 -0700"
References: <01f201c0b1be$6bc132a0$8eac323f@josiah>
Message-ID: 

> Thomas:
> > Has any of you planned to go to the BOSC and/or
> > ISMB2001 meeting in Copenhagen this summer ?
> 

Maybe I should mention that our department is organizing ISMB2001.
So, I am involved in some preparation parts and will be both at BOSC
and ISMB2001.

Should we have a biopython-pub meeting ? 
(In case yu didn't know - the danish beers ARE among the best beer's in the
world - I should know because I'm Austrian :-)

cheers
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From chapmanb at arches.uga.edu  Mon Mar 26 04:41:06 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
In-Reply-To: 
References: <01f201c0b1be$6bc132a0$8eac323f@josiah>
	
Message-ID: <15039.3762.356839.977282@taxus.athen1.ga.home.com>

Hee... I'm too tired and wired to repond to serious mail right now,
but I can handle this one :-)

Thomas:
> Maybe I should mention that our department is organizing ISMB2001.
> So, I am involved in some preparation parts and will be both at BOSC
> and ISMB2001.

Snazzy, since you're involved with the organizing we'll expect some 
special perks for biopythoners. I'm sure you'll be able to hook us all 
up with a biopython-penthouse-suite for, um, entertaining prospective
investors :-)

> Should we have a biopython-pub meeting ? 
> (In case yu didn't know - the danish beers ARE among the best beer's in the
> world - I should know because I'm Austrian :-)

I can give a big +1 for this idea. I don't know how the best beers in
the world can compare to the fine American beers I'm used to
drinking. I mean, can anything be as fine as the high quality Pabst 
Blue Ribbon? :-)

Brad


From dalke at acm.org  Mon Mar 26 04:58:37 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
Message-ID: <008801c0b5db$5516d980$aaab323f@josiah>

Brad:
>I mean, can anything be as fine as the high quality Pabst 
>Blue Ribbon? :-)

PBR?  What about Schlitz?  Or Old Milwaukee?

Not that I drink beer, so either cider for me, or I'll
bring the tequila.  :)

                    Andrew



From dalke at acm.org  Mon Mar 26 05:05:11 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] pySRS (was Re: ANN: mindy-0.1)
Message-ID: <008901c0b5dc$3fd17ca0$aaab323f@josiah>

Thomas:
> IMHO a biopython approach would include e.g. gdbm
> for a light version

Any reason for gdbm over Sleepycat's Berkeley DBM?
I admit to personal preference since an old boss of
mine is now at Sleepycat.  I also think it is more
powerful and has more development than gdbm.  For
example, bsddb can support multiple simultaneous
readers and writers, can do transactions, and allows
b-tree storage for ranged searches.

> (gdbm is present on almost all unixes and can be used
> [similar to cpickles] for FAST storage and retrieval
> of simple key-value data)

My background was mostly on SGIs which didn't have
gdbm installed.  I don't know about Solaris boxes
or other non-Linux/*BSD OSes.  Robin Dunn's interface
to bsddb includes a library for automatic pickling
to the database.  As I recall, part of Robin's work
is funded by Digital Creations so Zope can use bsddb
for its object database.

                    Andrew



From thomas at cbs.dtu.dk  Mon Mar 26 05:13:07 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
In-Reply-To: "Andrew Dalke"'s message of "Mon, 26 Mar 2001 02:58:37 -0700"
References: <008801c0b5db$5516d980$aaab323f@josiah>
Message-ID: 

"Andrew Dalke"  writes:

> Brad:
> >I mean, can anything be as fine as the high quality Pabst 
> >Blue Ribbon? :-)
> 
> PBR?  What about Schlitz?  Or Old Milwaukee?

Hej - I talked about Beer (Zipfer, Carlberg, Tuborg, G?sser etc.) .... not
this _*censured*_ Ale replicas ;-)


> 
> Not that I drink beer, so either cider for me, or I'll
> bring the tequila.  :)

Any Single Malt Fans present ?

-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From jchang at SMI.Stanford.EDU  Mon Mar 26 10:45:58 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Who is going to BOSC ?
In-Reply-To: 
Message-ID: 

On 26 Mar 2001, Thomas Sicheritz-Ponten wrote:

> "Andrew Dalke"  writes:
> > PBR?  What about Schlitz?  Or Old Milwaukee?

Milwaukee's Best?  In college, that's the cheap kind we used to buy for
large parties.  We started calling it "The Beast."

> Hej - I talked about Beer (Zipfer, Carlberg, Tuborg, G?sser etc.) .... not
> this _*censured*_ Ale replicas ;-)

I can vouch for this.  About the only thing I remember from my last trip
to denmark was the tour of the Carlberg brewery.  But I thought people in
Denmark just drank carlberg and tuborg?

Jeff


From katel at worldpath.net  Tue Mar 27 03:12:44 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
References: <008901c0b5dc$3fd17ca0$aaab323f@josiah>
Message-ID: <001801c0b695$b486f660$010a0a0a@cadence.com>

  ToEol is apparently stripping leading white space?

The Martel construct is
amino_acid_ref_journal_line = Martel.Group( "amino_acid_ref_journal_line",
                                Martel.Str( "AAREFJ" ) +
                                Martel.ToEol( "amino_acid_ref_journal" ) )

The source text is:
AAREFJ    1 J IMMUNOL 150: 4985-4995 (1993)


  The text in _Consumer.amino_acid_ref_journal is
1 J IMMUNOL 150: 4985-4995 (1993)


Is this how ToEol should work?


                          Cayte


From dalke at acm.org  Tue Mar 27 06:45:43 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
Message-ID: <002601c0b6b3$7570c5a0$b599fc9e@josiah>

Cayte:
>  ToEol is apparently stripping leading white space?

That's not what I get.  Here's my test:

>>> import Martel
>>> amino_acid_ref_journal_line =
artel.Group( "amino_acid_ref_journal_line",
...                                 Martel.Str( "AAREFJ" ) +
...
                                Martel.ToEol( "amino_acid_ref_journal" ) )
>>>
>>> parser = amino_acid_ref_journal_line.make_parser()
>>> from xml.sax import saxutils
>>> parser.setContentHandler(saxutils.XMLGenerator())
>>> parser.parseString("AAREFJ    1 J IMMUNOL 150: 4985-4995 (1993)\n")

AAREFJ    1 J IMMUNOL
150: 49
85-4995 (1993)


This has "" include the spaces.

>Is this how ToEol should work?

No.  ToEol should store all the whitespace - and everything else -
up to the end of line character(s).

What happens when you do this same test?

                    Andrew



From chapmanb at arches.uga.edu  Tue Mar 27 14:04:35 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
In-Reply-To: <002601c0b6b3$7570c5a0$b599fc9e@josiah>
Message-ID: 

Cayte:
> >  ToEol is apparently stripping leading white space?

Andrew:
> That's not what I get.  Here's my test:
[...Some convincing test results...]

This is just a random thought, but if Cayte is using the EventGenerator
class which I recently moved to Bio.ParserSupport, this *does* strip
whitespace before sending an event:

	# strip off whitespace and call the consumer
	callback_function = eval('self._consumer.' + name)
	info_to_pass = string.strip(self.info[name])
    	callback_function(info_to_pass)

I guess whether I should do this or not is up for debate. I know Jeff has
some differing opinions (and a good example of why this can be bad), but I
took this approach since I was already dealing with enough of a mess with
GenBank that I didn't want to fight with whitespace as well... If this is
really a problem here, I can look at fixing it.

Brad





From sarah at k-k.oz.au  Thu Mar 29 19:35:45 2001
From: sarah at k-k.oz.au (Sarah Kummerfeld)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Reverse parsing
In-Reply-To: <001801c0b695$b486f660$010a0a0a@cadence.com>
Message-ID: 

I have just started using biopython (I'm doing a project on
modelling the evolution of genes through intragenic duplication) 
and was wondering whether there was already an elegant way to turn
sequence objects or seqRecord objects back into one of the
file formats like fasta?

I know that it's not exactly a taxing job, but it seems like
a logical addition that could be incorporated into either the 
SeqRecord class itself or the parser. 

Thanks,

Sarah



From jchang at SMI.Stanford.EDU  Thu Mar 29 23:31:15 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] Reverse parsing
In-Reply-To: 
Message-ID: 

Hi Sarah,

No, there's nothing like that in biopython, although there should be.  
This shounds like something that should go into SeqIO.  We should keep a
set of functions there that can output SeqRecord objects into a variety of
formats.  Please let me know if you want to write the code!  :)

Jeff


On Fri, 30 Mar 2001, Sarah Kummerfeld wrote:

> 
> I have just started using biopython (I'm doing a project on
> modelling the evolution of genes through intragenic duplication) 
> and was wondering whether there was already an elegant way to turn
> sequence objects or seqRecord objects back into one of the
> file formats like fasta?
> 
> I know that it's not exactly a taxing job, but it seems like
> a logical addition that could be incorporated into either the 
> SeqRecord class itself or the parser. 
> 
> Thanks,
> 
> Sarah
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From katel at worldpath.net  Fri Mar 30 03:28:07 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
References: 
Message-ID: <002301c0b8f3$5a41cf80$010a0a0a@cadence.com>

----- Original Message -----
From: "Brad Chapman" 
> Cayte:
> > >  ToEol is apparently stripping leading white space?
>
> Andrew:
> > That's not what I get.  Here's my test:
> [...Some convincing test results...]
>
> This is just a random thought, but if Cayte is using the EventGenerator
> class which I recently moved to Bio.ParserSupport, this *does* strip
> whitespace before sending an event:
>

  Yes, I use EventGenerator.
> # strip off whitespace and call the consumer
> callback_function = eval('self._consumer.' + name)
> info_to_pass = string.strip(self.info[name])
>     callback_function(info_to_pass)
>
> I guess whether I should do this or not is up for debate. I know Jeff has
> some differing opinions (and a good example of why this can be bad), but I
> took this approach since I was already dealing with enough of a mess with
> GenBank that I didn't want to fight with whitespace as well... If this is
> really a problem here, I can look at fixing it.
>
>

    Some of the Kabat reference fields have a 1 in column 6.  There isn't
much documentation, but it looks like  a continuation flag to indicate that
the field is not the last field in the reference.  I like to preserve the
column structure because its possible, though unlikely, for the field
contents to start with 1.

  One option is for me to subclass EventHandler and rewrite _make_callback.
But this is inelegant because of the convention that a leading underscore
means a private function.  An alternative solution is a flag, with a default
of false, that bypasses the code to strip whitespace.  This leaves the
interface along but requires changes to the code

  Just some ideas.

                                                        Cayte


From dalke at acm.org  Fri Mar 30 01:36:27 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
Message-ID: <00aa01c0b8e3$f59762c0$8295fc9e@josiah>

Brad:
>> This is just a random thought, but if Cayte is using
>> the EventGenerator class which I recently moved to
>> Bio.ParserSupport, this *does* strip whitespace before
>> sending an event:

Cayte:
>  Yes, I use EventGenerator.

I'm looking at the EventGenerator.  I get the idea
of what it does.

However, I have some problems with how it does it.
Specifically, how it handles whitespace.  By default
it strips whitespace from the ends of the element's
contents and it merges multiple elements with the same
tag name into a single string seperated by whitespace.

Consider the case where someone stores alignment data
in a sequence format.  In that case, whitespace is
very significant, yet the default mode of EventGenerator
is to throw that information away and there is no
way for the client code to specify otherwise.

As an existence proof, Cayte ran into a case where that
significant information was tossed.

There are a couple problems with the automatic joining
of successive tag code.  It is used for cases like:

This is a description field
using two lines.

which automatically gets turned into

This is a description field using two lines.

For that case it is useful, but consider

EKLAD
WERNDA

Do people expect
EKLAD WERNDA

over
EKLAD WERNDA

Even worse, I tried something like this with the XSLT
code to turn the description lines into a single field
for SWISS-PROT.  I ran into problems with records like:

DE  This record matches the Prosite pattern A-B-C-D-E-
DE  F-G-H-I but is a false positive.

When merged together, the pattern string becomes
A-B-C-D-E F-G-H-I

On the other hand, with the original callbacks, the
handler could be made smart enough to know that if
a line ends in a hyphen that it could be a split
pattern, or even a hypenated word and attempt to
reconstruct the "real" form of the data.

Not trivial work, but doable with some heuristics.

But if the handler only gets the merged data fields
then it doesn't have the clue that the "-" is at the
end of a line, which makes this sort of cleanup
much harder or at least more error prone.

Brad:
>> I guess whether I should do this or not is up for debate.
>> I know Jeff has some differing opinions (and a good example
>> of why this can be bad),

I'm going to have to agree with Jeff on this.  Whitespace
must be treated as important, unless told otherwise.

One way to address it is with composition, or whatever
the appropriate GoF name is.  Have handlers take handlers,
where each handler modifies the data stream as needed.

  consumer = YourConsumer()
  handler = Sax2Consumer(consumer)
  cleanup = MergeFieldsAndStripWhitespace(handler)
  filter = FilterEvents(("some", "list", "of", "tags"), cleanup)
  parser.setContentHandler(filter)

In this case, filter only passes along the named tags,
cleanup does the merging of fields and whitespace stripping,
and Sax2Consumer translates the start/endElement calls to
the form expected by the consumer.

This may be too fine-grained, but the point is that there
are ways to make it easy to meet the different needs without
much extra work, while also being explicit on what's
actually going on with the data.  No surprises is a good
thing.  (Except for the surprise that there are no surprises :)

>> If this is really a problem here, I can look at fixing it.

Please consider that.

Cayte:
>  One option is for me to subclass EventHandler and rewrite
> _make_callback. But this is inelegant because of the
> convention that a leading underscore means a private function.

Agreed.  This is because it's doing two things, normalizing
the callback text and passing the result to the consumer.
That normalization needs to be accessible to change, while the
rest of it does not.

> An alternative solution is a flag, with a default
> of false, that bypasses the code to strip whitespace.

I do not like flags.  In almost all cases there is a
better way to do it.  For example, another way is to
call "self._cleanup" where "_cleanup" could be a method
which may be overridden in subclasses but defaults to
string.strip .

The same would need to be done with the join function.
although this is harder to do because it isn't a simple
choice of which character to use - some people will
not want that joining to occur.

Brad:
>> # strip off whitespace and call the consumer
>> callback_function = eval('self._consumer.' + name)

Please change this to use
   callback_function = getattr(self._consumer, name)
This is faster and safer.  Evals can do nasty things,
like if name is
   abc + __import__("shutil").rmtree("/")

Also, you might want to change
   if name in self.flags.keys():
to
   if self.flags.has_key(name):

Python 2.1 will allow (as I recall)
   if name in self.flags:

                    Andrew
                    dalke@acm.org



From dalke at acm.org  Fri Mar 30 01:40:56 2001
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
Message-ID: <00c401c0b8e4$64eecdc0$8295fc9e@josiah>

Oops, should have written:

>Do people expect
>EKLAD WERNDA
>
>over
>EKLADWERNDA
               ^^^ No space here

                    Andrew



From chapmanb at arches.uga.edu  Fri Mar 30 20:40:02 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:58 2005
Subject: [Biopython-dev] ToEol in Martel
In-Reply-To: <00aa01c0b8e3$f59762c0$8295fc9e@josiah>
References: <00aa01c0b8e3$f59762c0$8295fc9e@josiah>
Message-ID: <15045.13682.530560.511561@taxus.athen1.ga.home.com>

Cayte and Andrew;

Thanks much for the feedback on EventGenerator. I've updated it
(changes in CVS) to, I think, handle your concerns. Generally, what I
did was change it so instead of trying to combine multiple with
spaces, it just collects up all of these lines and returns them as a
list. So, if we have Martel output that looks like:

EKLAD
WERNDA

EventGenerator will now call the consumer with a list like:

["EKLAD", "WERNDA"]

This way, you can deal with multiple lines on a case by case basis in
the consumer, if necessary. 

Additionally, I added the ability to pass a "finalizer" function to
EventGenerator which, if present, will be called before a list of
information is returned. This way, if you always want to reformat
things (as I do in GenBank), then you can still do this. The finalizer 
function gets passed the list of lines, and can do whatever it wants
with it.

If this doesn't seem like a good solution, let me know and we can work 
on it more. I added tests to test_ParserGenerator, if you want to see
specific cases (besides GenBank) of how it works.

Andrew:
> However, I have some problems with how it does it.
> Specifically, how it handles whitespace.  By default
> it strips whitespace from the ends of the element's
> contents and it merges multiple elements with the same
> tag name into a single string seperated by whitespace.

I agree with you (and Jeff, and probably everyone else in the world
:-). I shouldn't muck with whitespace by default. Bad Brad, bad!
Hopefully the new EventGenerator code more faithfully keeps it.

> There are a couple problems with the automatic joining
> of successive tag code.  It is used for cases like:

[...lots of examples of how auto joining can mess up...]

Yup, I'm in full agreement with you here as well. My GenBank solution
is definately a quick-n-dirty one. I think it handles most cases
fairly well. Hopefully with this new format, specific heuristics for
specific cases can be introduced into the Biopython consumer classes,
if necessary. But, yeah, it is an ugly problem all around. 

I'm trying to push this problem back into the Consumer classes and not 
deal with it in EventGenerator. EventGenerator was just meant to do two 
things:

1. Be a general way to turn Martel events into Biopython-type events.

2. Handle stuff that runs over multiple lines, so that this kind of
code wouldn't have to go into Biopython-consumers.

I think it does these things a little better now :-)

[style changes]
> Please change this to use
>    callback_function = getattr(self._consumer, name)
> This is faster and safer.  Evals can do nasty things,
> like if name is
>    abc + __import__("shutil").rmtree("/")
> 
> Also, you might want to change
>    if name in self.flags.keys():
> to
>    if self.flags.has_key(name):

Thanks for the pointers. Style changes are always very welcome. I had
been fighting in the past to get getattr to work right (I'm pretty
sure you mentioned this to me previously). Thanks!

Thanks again for the feedback on this. Hope this solution is workable!
Brad