From chapmanb at arches.uga.edu Sun Oct 1 10:48:31 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel based replacement for Fasta _Scanner Message-ID: <200010011448.KAA83954@archa13.cc.uga.edu> Hey all; As I was talking about yesterday, I went ahead and generated a Martel-based replacement for the current _Scanner framework that Jeff wrote for Fasta parsing. I was just interested in doing this so that I could see how Martel based parsing could fit in with the nice Scanner/Consumer framework that Jeff set up. Basically the approach I took was to let Martel do the low level parsing, and then generate the appropriate scanner events using the SAX handler that looks at the XML generated by Martel. So basically all I did was rewrite the _Scanner to use Martel. I attached two files to this mail which shows this in action: 1. Fasta.py -> This is a replacement for Bio/Fasta/Fasta.py. It just replaces _Scanner and adds a SAX handler class to turn the Martel XML into Scanner events. 2. fasta_format.py -> This should be put in Bio/Fasta, and is the Martel based regexp for reading fasta files. My regular expressions suck, so this got pretty ugly, especially when I was trying to deal with that annoying dos line break stuff in the test suite. I'm quite open to suggestions for making this nicer! This should work almost exactly the same as the _Scanner class from before, except that it parses everything that gets fed into it (instead of just one record from a file, as before). So all of the tests work with the new parser, but test_Fasta will fail in the regression test because of this different behavior. Feedback on all of this would be very welcome! Brad -------------- next part -------------- A non-text attachment was scrubbed... Name: Fasta.py Type: application/x-unknown Size: 11129 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001001/ff075c6e/Fasta.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: fasta_format.py Type: application/x-unknown Size: 1219 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001001/ff075c6e/fasta_format.bin From chapmanb at arches.uga.edu Sun Oct 1 19:52:25 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Parser problem with blastpgp v. 2.0.13? In-Reply-To: Message-ID: <200010012352.TAA24976@archa12.cc.uga.edu> Iddo wrote: > I already submitted a bug report (#16). Basically, i cannot seem to > work the NCBIStandalone parser with the output I get. I did run it on > similar btXXX files, and that seemed to go well. > > I am using blastpgp V 2.0.13 Hmmm... I took a quick look at this, and I think this is the problem. It looks from the comments that Jeff has only tested this with v 2.0.10 and v 2.0.11 so it looks like the output has changed somewhat (of course!). I think the problem is that _scan_masterslave_alignment isn't figuring out that it should stop reading alignments, so it is trying to convert '....' into an integer, which obviously didn't work so hot. The new break between between rounds is a Searching.... line, instead of the Database line that _scan_masterslave_alignment is looking for, so if you add a check to break on finding Searching..., then the parse seems to complete okay. I was playing with this to look at the results, and it also looks like the record isn't giving up the data from the multiple alignments, so I also had a quick patch to fix this. Here are the patches, against CVS, that seem to make things look okay for me. Jeff is the master of Blast, so it is up to him to approve these (or let me know where I went wrong :-). Hope this helps. Brad *** NCBIStandalone.py.orig Sun Oct 1 18:36:01 2000 --- NCBIStandalone.py Sun Oct 1 19:47:27 2000 *************** *** 329,335 **** consumer.start_alignment() while 1: line = safe_readline(uhandle) ! if line[:10] == ' Database': uhandle.saveline(line) break elif is_blank_line(line): --- 329,340 ---- consumer.start_alignment() while 1: line = safe_readline(uhandle) ! # PSIBlast 2.0.13 appears to have a Searching... line after ! # rounds instead of a Database line ! if line[:9] == 'Searching': ! uhandle.saveline(line) ! break ! elif line[:10] == ' Database': uhandle.saveline(line) break elif is_blank_line(line): *************** *** 1178,1184 **** _AlignmentConsumer.end_alignment(self) if self._alignment is not None: self._round.alignments.append(self._alignment) ! elif self._multiple_alignment is not None: self._round.multiple_alignment = self._multiple_alignment def end_hsp(self): --- 1183,1189 ---- _AlignmentConsumer.end_alignment(self) if self._alignment is not None: self._round.alignments.append(self._alignment) ! if self._multiple_alignment is not None: self._round.multiple_alignment = self._multiple_alignment def end_hsp(self): From jchang at SMI.Stanford.EDU Mon Oct 2 00:16:20 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Parser problem with blastpgp v. 2.0.13? In-Reply-To: <200010012352.TAA24976@archa12.cc.uga.edu> Message-ID: Hi Iddo, Yes, there's definitely a bug here. The problem is that the master-slave alignment code wasn't built with PSI-BLAST in mind. When it encounters a master-slave alignment, it keeps on parsing until it finds the database section. Unfortunately, this breaks with PSI-BLAST, since it generates many alignments in a single run. The fix, as Brad noted, is to allow the masterslave code to recognize psi-blast output. This has been checked in and will go into the next developmental release. Thanks for the report and the fix! Jeff On 1 Oct 2000, Brad Chapman wrote: > Iddo wrote: > > I already submitted a bug report (#16). Basically, i cannot seem to > > work the NCBIStandalone parser with the output I get. I did run it on > > similar btXXX files, and that seemed to go well. > > > > I am using blastpgp V 2.0.13 > > Hmmm... I took a quick look at this, and I think this is the problem. It > looks from the comments that Jeff has only tested this with v 2.0.10 and > v 2.0.11 so it looks like the output has changed somewhat (of course!). > > I think the problem is that _scan_masterslave_alignment isn't figuring > out that it should stop reading alignments, so it is trying to convert > '....' into an integer, which obviously didn't work so hot. > > The new break between between rounds is a Searching.... line, instead of > the Database line that _scan_masterslave_alignment is looking for, so if > you add a check to break on finding Searching..., then the parse seems to > complete okay. > > I was playing with this to look at the results, and it also looks like > the record isn't giving up the data from the multiple alignments, so I > also had a quick patch to fix this. > > Here are the patches, against CVS, that seem to make things look okay for > me. Jeff is the master of Blast, so it is up to him to approve these (or > let me know where I went wrong :-). Hope this helps. > > Brad > > *** NCBIStandalone.py.orig Sun Oct 1 18:36:01 2000 > --- NCBIStandalone.py Sun Oct 1 19:47:27 2000 > *************** > *** 329,335 **** > consumer.start_alignment() > while 1: > line = safe_readline(uhandle) > ! if line[:10] == ' Database': > uhandle.saveline(line) > break > elif is_blank_line(line): > --- 329,340 ---- > consumer.start_alignment() > while 1: > line = safe_readline(uhandle) > ! # PSIBlast 2.0.13 appears to have a Searching... line after > ! # rounds instead of a Database line > ! if line[:9] == 'Searching': > ! uhandle.saveline(line) > ! break > ! elif line[:10] == ' Database': > uhandle.saveline(line) > break > elif is_blank_line(line): > *************** > *** 1178,1184 **** > _AlignmentConsumer.end_alignment(self) > if self._alignment is not None: > self._round.alignments.append(self._alignment) > ! elif self._multiple_alignment is not None: > self._round.multiple_alignment = self._multiple_alignment > > def end_hsp(self): > --- 1183,1189 ---- > _AlignmentConsumer.end_alignment(self) > if self._alignment is not None: > self._round.alignments.append(self._alignment) > ! if self._multiple_alignment is not None: > self._round.multiple_alignment = self._multiple_alignment > > def end_hsp(self): > > > > > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From dalke at acm.org Mon Oct 9 07:45:41 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel-0.3 available Message-ID: <003f01c031e6$7480a180$99d6cdcf@josiah> Martel-0.3 is finally available. This finishes the major cleanup, has a more SAX-like interface, fixes various problems, adds a framework and code for parsing a record at a time, and has first draft documentation about both the internals and a tutorial on how to write a parser. Excepting bug fixes, this is the last version which will work with Python 1.5.2. The next one will work with 2.0 and use its new xml package. There are several changes in this version which are incompatible with the 0.25 release - mostly so that the SAX names are correct (eg, now using DocumentHandler instead of ContentHandler, which was just wrong). There should be very few API breakages in future versions other than the support for the new XML module and SAX 2.0 and a change/simplification in how to access parsers for specific formats and format versions. Martel can be found at http://www.biopython.org/~dalke/Martel/ . Links to the new documentation are available from that page. Andrew Dalke dalke@acm.org From chapmanb at arches.uga.edu Mon Oct 9 15:49:28 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel-0.3 available In-Reply-To: <003f01c031e6$7480a180$99d6cdcf@josiah> Message-ID: <200010091949.PAA124060@archa13.cc.uga.edu> Andrew wrote: > Martel-0.3 is finally available. Cool! Thanks for this. > a tutorial on how to write a parser. Very nice. I wish my Fasta parser looked as nice as yours :-). One thing that had me concerned about your parser, though, was the use of Str("\n") to detect end of new lines. I was using this with lots o' luck with all of my unix formatted files, but it didn't seem to work right for me when I was using it on the Windows formatted (I think) files in the Fasta test directory. I ended up having to use Martel.MaxRepeat(Martel.Re("[\s]"), 0, 2) to detect end o' lines, which seems to work properly, but it pretty ugly looking. Yours seemed to work okay though at detecting the end of the lines, so I'm not positive what is going on... Hmmm, I don't know, I'll have to look at this more, I guess. I don't really know anything at all about line-break madness. > Excepting bug fixes, this is the last version which will work with > Python > 1.5.2. The next one will work with 2.0 and use its new xml package. Since I'm using 2.0 right now, I made the necessary changes to get it working for me with just the xml packages. The changes I made were in Parser.py, and are attached as Parser.diff, in case they will be of any use to you in making these changes. BTW, pyXML-0.6.1 is out, so hopefully now 1.5.2 with PyXML should work interchangably with python2.0 alone. > mostly so that the SAX names are correct (eg, now > using DocumentHandler instead of ContentHandler, which was just > wrong). Really? I didn't even see DocumentHandler in 2.0 -- I think that ContentHandler is DocumentHandler (at least in 2.0), but I'm not positive. Hard to follow all of the changes in that stuff... Thanks again for the new version -- I'm looking forward to having some time to play around with it more :-). Brad -------------- next part -------------- A non-text attachment was scrubbed... Name: Parser.diff Type: application/x-unknown Size: 9161 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001009/a2109b64/Parser.bin From dalke at acm.org Mon Oct 9 20:15:34 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel-0.3 available Message-ID: <002e01c0324f$372a5420$46d5cdcf@josiah> Brad: > One thing that had me concerned about your parser, though, was the > use of Str("\n") to detect end of new lines. I was using this with > lots o' luck with all of my unix formatted files, but it didn't seem > to work right for me when I was using it on the Windows formatted (I > think) files in the Fasta test directory. I ended up having to use > Martel.MaxRepeat(Martel.Re("[\s]"), 0, 2) to detect end o' lines, > which seems to work properly, but it pretty ugly looking. Yeah, I'm worried about that as well, but I haven't really looked at the problem. Dug around for a bit now. Under MS Windows, reading a native file (which "od -c" shows as having "\r\n"), open("test.dat").read() only shows "\n", so it's been translated as I expect. Using open("test.dat", "rb").read() shows the "\r\n". So so long as the file is read in text mode and is used on an OS with the same line endings, then it will be fine. However, it does mean my byte counts will be off, depending on your viewpoint :( There might be a problem with interoperability between difference OSes. That could be addressed in one of several ways: 1) require the input to be converted to the local line ending and provide no support for doing so 2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't use them; instead leaving the decision up to the client code 3) provide a tool which autodetects endings and uses the right adapter 4) http://members.nbci.com/_XOOM/meowing/python/index.html 5) define an EOL = Re(r"\n|\r\n?") I prefer 2-4, but would like to stick with 1 for now. I don't like 5 because people will forget to use it. > I don't really know anything at all about line-break madness. I've been a unix weenie for too long, and agree with you. > I didn't even see DocumentHandler in 2.0 -- I think that > ContentHandler is DocumentHandler (at least in 2.0), but I'm not > positive. Hard to follow all of the changes in that stuff... According to my XML book, it's Document Handler, and it works with DOM and the other XML tools, so it's likely correct. Andrew From dalke at acm.org Wed Oct 11 21:43:51 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel-0.3 available Message-ID: <018e01c033ed$e0ef0380$d5ab323f@josiah> Brad: > Really? I didn't even see DocumentHandler in 2.0 -- I think that > ContentHandler is DocumentHandler (at least in 2.0), but I'm not > positive. Hard to follow all of the changes in that stuff... *sigh* I figured out the problem. ContentHandler is the SAX 1.0 interface while DocumentHandler is 2.0. My XML book, which is less than a year old, covers the 1.0 interface only. SAX 2 also adds methods for namespace support like startPrefixMapping. (huh?) So Martel uses the old SAX API and I've got to figure out the new one. Yippee. Andrew From dalke at acm.org Thu Oct 12 03:55:37 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel timings Message-ID: <000901c03421$e502b960$85ab323f@josiah> I'm starting to compare Martel parsing with the existing biopython code. I wrote a Martel document handler called SwissProtBuilder.py (attached) which creates Bio.SwissProt.Record objects. The output is comparable to the existing code, although this new code is wrong. (There are a couple of minor things I need to fix in the grammer to make the parsing easy.) The timings are also comparable. The biopython code is about 8% slower than the Martel code. The Martel code takes about 25 minutes to parse sprot38. Because of the new RecordReader, it only needed about 4MB of memory. I assume the biopython code is at least that good. One of the reasons for the good performance on the Martel side is that I'm pruning the expression tree to get rid of events which aren't handled by the callback object. That eliminates a lot of function call overhead. I also turned a long if/elif chain in endElement into a dispatch table, which saved more time because of the conversion from an O(N) lookup to O(1). It turns out there is a bug in the pruning code because the RecordReader doesn't prune its children. It doesn't cause an error but just a slowdown, so I didn't notice it until now. I've included a patch with this email which brings Martel-0.3 up to my internal development version. Andrew dalke@acm.org -------------- next part -------------- """SwissProtBuilder - create a biopython Bio.SwissProt.Record This is a first attempt at a Martel interface to create SwissProt records. It is incomplete because Martel's SWISS-PROT format definition is a bit lacking, although not enough to affect timings. I have a test data set which is the first 200024 lines of sprot38. It takes this code 59.9 seconds to parse the file while the existing biopython code takes 65.1 seconds, so about 8% faster. There is still some performance I can eek out of this. All of sprot38 takes around 25 minutes to parse. The mxTextTools analysis takes about 10 minutes so the rest is spent in callbacks and creation code. """ import string from Bio.SwissProt import KeyWList, SProt from xml.sax import saxlib # These are elements whose text I want to get capture_names = ("entry_name", "data_class_table", "molecule_type", "sequence_length", "ac_number", "day", "month", "year", "description", "gene_names", "organism_species", "organelle", "organism_classification", "reference_number", "reference_position", "reference_comment", "bibliographic_database_name", "bibliographic_identifier", "reference_author", "reference_title", "reference_location", "comment_text", "database_identifier", "primary_identifier", "secondary_identifier", "status_identifier", "keyword", "ft_name", "ft_from", "ft_to", "ft_description", "molecular_weight", "crc32", "sequence", ) # These are all of the elements events I'm interested in select_names = capture_names + \ ("swissprot38_record", "DT_created", "DT_seq_update", "DT_ann_update", "reference", "feature", "ID", "reference", "DR", "comment") class SwissProtBuilder(saxlib.DocumentHandler): def __init__(self): self.records = [] self.capture = 0 def startElement(self, name, attrs): # Arranged in order of most used to least if name in capture_names: self.capture = 1 self.text = "" elif name == "reference": self.reference = SProt.Reference() elif name == "feature": self.ft_desc = "" elif name == "comment": self.comment = "" elif name == "swissprot38_record": self.record = SProt.Record() elif name == "DT_created": self.in_date = "created" self.date = [] elif name == "DT_seq_update": self.in_date = "sequence_update" self.date = [] elif name == "DT_ann_update": self.in_date = "annotation_update" self.date = [] def characters(self, ch, start, length): if self.capture: self.text = self.text + ch[start:start+length] def endElement(self, name): # Doing the dispatch like this instead of a chain of if/elif # statements saved me about 15% because the lookup time goes # from O(N) to O(1) f = getattr(self, "end_" + name, None) if f is not None: f() if self.capture: del self.text self.capture = 0 def end_swissprot38_record(self): self.record.sequence = string.replace(self.record.sequence, " ", "") # Delete for now since I'm just doing timings #self.records.append(self.record) #print self.record del self.record def end_entry_name(self): self.record.entry_name = self.text def end_data_class_table(self): self.record.data_class = self.text def end_molecule_type(self): self.record.molecule_type = self.text def end_sequence_length(self): # Used in both the ID and the SQ lines self.seq_len = int(self.text) def end_ID(self): self.record.sequence_length = self.seq_len def end_ac_number(self): self.record.accessions.append(self.text) def end_day(self): self.date.append(self.text) def end_month(self): self.date.append(self.text) def end_year(self): self.date.append(self.text) setattr(self.record, self.in_date, "%s-%s-%s" % tuple(self.date)) def end_description(self): if self.record.description == "": self.record.description = self.text else: self.record.description = self.record.description + self.text def end_gene_names(self): # XXX parser isn't correct self.record.gene_name = self.text def end_organism_species(self): # XXX parser isn't correct self.record.organism = self.text def end_organelle(self): # XXX parser isn't correct self.record.organelle = self.text def end_organism_classification(self): # XXX parser isn't correct self.record.organism_classification.extend(\ string.split(self.text[:-1], "; ")) def end_reference(self): self.record.references.append(self.reference) del self.reference def end_reference_number(self): self.reference.number = int(self.text) def end_reference_position(self): # XXX Why is this a list? self.reference.positions.append(self.text) def end_reference_comment(self): # XXX needs to be list of (token, text) self.reference.comments.append(self.text) def end_bibliographic_database_name(self): self.bib_db_name = self.text def end_bibliographic_identifier(self): self.reference.references.append( (self.bib_db_name, self.text) ) def end_reference_author(self): if self.reference.authors: self.reference.authors = self.reference.authors + " " + self.text else: self.reference.authors = self.text def end_reference_title(self): if self.reference.title: self.reference.title = self.reference.title + " " + self.text else: self.reference.title = self.text def end_reference_location(self): if self.reference.location: self.reference.location = self.reference.location + " " + self.text else: self.reference.location = self.text def end_comment_text(self): if self.comment: self.comment = self.comment + " " + self.text else: self.comment = self.text def end_comment(self): self.record.comments.append(self.comment) def end_database_identifier(self): self.db_id = self.text def end_primary_identifier(self): self.ids = [self.text] def end_secondary_identifier(self): self.ids.append(self.text) def end_status_identifier(self): self.ids.append(self.text) def end_DR(self): self.record.cross_references.append( (self.db_id,) + tuple(self.ids)) def end_keyword(self): # XXX parser isn't correct kw = string.split(self.text[:-1], "; ") self.record.keywords.extend(kw) def end_feature(self): self.record.features.append( (self.ft_name, self.ft_from, self.ft_to, self.ft_desc) ) def end_ft_name(self): self.ft_name = string.rstrip(self.text) def end_ft_from(self): self.ft_from = string.lstrip(self.text) # Jeff first tries int ... def end_ft_to(self): self.ft_to = string.lstrip(self.text) # Jeff first tries int ... def end_ft_description(self): if self.ft_desc: self.ft_desc = self.ft_desc + " " + self.text else: self.ft_desc = self.text def end_molecular_weight(self): self.mw = int(self.text) def end_crc32(self): self.record.seqinfo = (self.seq_len, self.mw, self.text) def end_sequence(self): # Strip out spaces in end_swissprot38_record self.record.sequence = self.record.sequence + self.text def test(): from Martel.formats import swissprot38 from xml.sax import saxutils import Martel import time t1 = time.time() # Send only the events which the callback will use # (saves another 32% of performance, after doing the if/elif speedup) format = Martel.select_names(swissprot38.format, select_names) parser = format.make_parser() dh = SwissProtBuilder() parser.setDocumentHandler(dh) eh = saxutils.ErrorRaiser() parser.setErrorHandler(eh) #infile = open("/home/dalke/src/Martel/examples/sample.swissprot") #infile = open("/home/dalke/ftps/swissprot/sprot38.dat") infile = open("/home/dalke/ftps/swissprot/smaller_sprot38.dat") t2 = time.time() parser.parseFile(infile) t3 = time.time() print "startup", t2-t1 print "eval", t3-t2 if __name__ == "__main__": test() -------------- next part -------------- A non-text attachment was scrubbed... Name: Martel-0.3.patch Type: application/octet-stream Size: 1053 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/a846b25c/Martel-0.3.obj From chapmanb at arches.uga.edu Thu Oct 12 14:10:36 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Changes in WWW BLAST format Message-ID: <200010121810.OAA84314@archa12.cc.uga.edu> Hey all; I was working on the BLAST documentation, and while whipping up an example for WWWBlast, noticed that it seems like the html output has changed (again). It looks like the culprits are a new

tag, and a tag in place of a blank line. The output version is 2.1.1, I believe. The diff at the end of this seems to fix things for me, but of course I wanted to run it by the master o' blast first before committing. If you can't reproduce the parsing problem, please let me know and I'll send a script that demonstrates it. Yours in blasting, Brad *** NCBIWWW.py.orig Sat Aug 12 16:23:24 2000 --- NCBIWWW.py Tue Oct 10 21:15:06 2000 *************** *** 148,153 **** --- 148,156 ---- # Read the RID line, for version 2.0.12 (2.0.11?) and above. attempt_read_and_call(uhandle, consumer.noevent, start='RID') + # 2.1.1 seems to have another

here + attempt_read_and_call(uhandle, consumer.noevent, start='

') + # Read the Query lines and the following blank line. read_and_call(uhandle, consumer.query_info, contains='Query=') read_and_call_until(uhandle, consumer.query_info, blank=1) *************** *** 204,211 **** read_and_call(uhandle, consumer.noevent, blank=1) # Read the descriptions and the following blank line. ! read_and_call_until(uhandle, consumer.description, blank=1) ! read_and_call_while(uhandle, consumer.noevent, blank=1) consumer.end_descriptions() --- 207,214 ---- read_and_call(uhandle, consumer.noevent, blank=1) # Read the descriptions and the following blank line. ! read_and_call_until(uhandle, consumer.description, contains = ') ! read_and_call_while(uhandle, consumer.noevent, contains = '') consumer.end_descriptions() From chapmanb at arches.uga.edu Thu Oct 12 14:27:24 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST Message-ID: <200010121827.OAA33200@archa10.cc.uga.edu> Hello again; More blast stuff from me -- can you tell I've had to parse a lot of BLAST reports recently? :-). Anyways, I've been using the standalone BLAST parser to parse some big ol' BLAST runs that I'm doing, and I noticed that occassionally blastall will report an error while running. This a pretty uninformative error, and will generally either say something about being unable to calculate parameters during the BLAST. Well, I investigated further and found out that BLAST quits trying to run a search when it gets to a junk sequence like this: >gi|9854647|gb|BE599574.1|BE599574 PI1_77_C09.g1_A002 Pathogen induced 1 (PI1) Sorghum bicolor cDNA, mRNA sequence TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAA Right, so this is useless junk sequence and BLAST is right to bomb out on it. The report that BLAST generates on something like this is attached. Basically, the problem is that a truncated report missing all of the statistics at the end. This causes the parser to run out of lines without finding the statistics it is looking for and generate a SyntaxError. What I'd like to propose is that the parser generate a new exception for these kind of reports, a NCBIStandalone.BlastError exception, indicating a failure in Blast, not in the parser. The reason I want to do this is that I would like to rig the exception up to return the query that failed in this way, so that I can easily send some messages to the owners of these sequences, asking them to kindly remove the sequence from GenBank. Anyways, attached is a patch (NCBIStandalone.diff) that implements this type of exception-raising behavior for the BlastParser, which allows you to parse like this: try: b_record = iterator.next() except NCBIStandalone.BlastError, info: print 'Got a blast error on query', info[1] Do people think this is a good idea and something that can get into the standalone parser? Comments are very welcome! Brad -------------- next part -------------- A non-text attachment was scrubbed... Name: problem.blast Type: application/x-unknown Size: 834 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/60782c48/problem.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: NCBIStandalone.diff Type: application/x-unknown Size: 2024 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/60782c48/NCBIStandalone.bin From chapmanb at arches.uga.edu Thu Oct 12 20:38:05 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel-0.3 available In-Reply-To: <002e01c0324f$372a5420$46d5cdcf@josiah> Message-ID: <200010130038.UAA49974@archa10.cc.uga.edu> Andrew wrote: [my worries about different types of line breaks] > There might be a problem with interoperability between difference > OSes. > That could be addressed in one of several ways: > 1) require the input to be converted to the local line ending and > provide no support for doing so > 2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't > use them; instead leaving the decision up to the client code > 3) provide a tool which autodetects endings and uses the right > adapter > 4) http://members.nbci.com/_XOOM/meowing/python/index.html > 5) define an EOL = Re(r"\n|\r\n?") > > I prefer 2-4, but would like to stick with 1 for now. I don't like 5 > because people will forget to use it. Hmmm, I don't know, I think I like 5 best of all of these options. There is definately the problem of people forgetting, as you mention, but it does have a number of bonuses: 1. Easy to implement, and isn't very likely to break :-). 2. Provided the regexp would recognize Mac line breaks (hmmm, I'm not positive what those look like) then this could deal with files with multiple different types of line breaks without whining. There are times where people have generated files like this in my lab (the sequencer is running Windows, but they like to play around on the files on a Mac -- I still don't know how they got a mix of line breaks -- I think by cutting and pasting between files with different line breaks). Anyways, the point is that the regexp can deal with "worst case" scenarios, whereas the other options can bomb out. Anyways, that is why I am for 5, especially as a short-term solution over 1. Brad From chapmanb at arches.uga.edu Thu Oct 12 20:51:45 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel timings In-Reply-To: <000901c03421$e502b960$85ab323f@josiah> Message-ID: <200010130051.UAA134710@archa13.cc.uga.edu> Andrew wrote: > I'm starting to compare Martel parsing with the existing biopython > code. I wrote a Martel document handler called SwissProtBuilder.py > (attached) which creates Bio.SwissProt.Record objects. > > The biopython code is about 8% > slower than the Martel code. The Martel code takes about 25 minutes to > parse sprot38. Cool! It is great to hear they are both comparible in terms of times. I'm definately not a speed freak myself, but it is very nice to have a slight speed improvement (and at least not a speed decrease) on switching over to Martel based parser stuff. > Because of the new RecordReader, it only needed about 4MB of memory. > I assume the biopython code is at least that good. Hmmm, one side not about RecordReader. I really like the way you can interface with the parsers in multiple ways in the current Biopython parsing. I think it is really useful to be able to iterate over a record and get the record back, instead of automatically having to parse it (I find this useful for pulling a "bad" record out of a big file of records). Do you think there is a way to make the RecordReader act similar to the Iterators in this regard? Right now, the fact that it is reading things one record at a time is kind of hidden inside the parse, and I'm not exactly positive how you can make the record reader just return the raw info making up the record that is being parsed. BTW, I like the StartsWith, EndsWith in the new RecordReader! When I was doing the FASTA stuff I couldn't figure out any way to recognize new files with only the EndsWith behavior :-). > One of the reasons for the good performance on the Martel side is that > I'm pruning the expression tree to get rid of events which aren't > handled by the callback object. That eliminates a lot of function call > overhead. Very cool idea to reduce the size of the XML generated and returned. Nifty stuff! Brad From dalke at acm.org Fri Oct 13 01:54:46 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Martel timings Message-ID: <01fc01c034da$1837d140$85ab323f@josiah> Brad: > Hmmm, one side not about RecordReader. I really like the way you can > interface with the parsers in multiple ways in the current Biopython > parsing. I think it is really useful to be able to iterate over a record > and get the record back, instead of automatically having to parse it I've got experimental code for that at http://www.biopython.org/~dalke/SaxRecords.py It uses a new threads for the callback object and a Queue to send parsed records back to the iterator interface in the originating thread. Currently looses memory if the thread doesn't go to completion because the new thread is sitting there waiting for the queue to empty. > (I find this useful for pulling a "bad" record out of a big file of > records). That's a bit different topic. Currently all errors are "fatalError"s, which under the SAX spec means the parser must stop. However, SAX also supports "error"s, which are recoverable. (Of course, the error handler can raise an exception, which causes a dead stop in the parser.) Huh, there's some bugs in the record parser code: elif isinstance(result, saxlib.SAXException): # Wrong format self.err_handler.fatalError(result) return else: # did not reach end of string pos = filepos + result self.err_handler.fatalError(StateTableEOFException(pos)) That last branch should do a "return" to meet the spec, and as I learned yesterday, both need to send an "endDocument" event after the fatalError. And I do need to fix the following to give some sort of error event. record = reader.next() # XXX what if an exception is raised? > Do you think there is a way to make the RecordReader act similar to > the Iterators in this regard? So yes. Convert the "fatalError" events to "error" and do recovery by skipping to the next record. Then have the SaxRecords code, which does the Iterator-like interface, return the right information for problematical records. Umm, what does the Iterator do for bad records? It looks like it raises an exception, but allows you to call next() to get the next record? That's reasonable to me (since I think I can support it :) I'll work on it; unless you want to do it? > BTW, I like the StartsWith, EndsWith in the new RecordReader! When I was > doing the FASTA stuff I couldn't figure out any way to recognize new > files with only the EndsWith behavior :-). Thanks! If you didn't notice, it also plays some tricks to read ahead many lines, which should give better overall performance. The File.UndoHandle isn't as tricky but has better guarantees of where it is in the file and it allows undos, which Martel doesn't need. I bet changing the code to read ahead multiple lines would speed up the existing biopython code. >> pruning the expression tree > reduce the size of the XML generated and returned. Good point - I hadn't even thought about how it affect XML output. I was more concerned about reducing function call overhead. Andrew dalke@acm.org From jchang at SMI.Stanford.EDU Fri Oct 13 02:16:28 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Changes in WWW BLAST format In-Reply-To: <200010121810.OAA84314@archa12.cc.uga.edu> Message-ID: Great! Thanks for catching this. Could you send the output? I'd like to add it to the suite of blast tests. > *************** > *** 204,211 **** > read_and_call(uhandle, consumer.noevent, blank=1) > > # Read the descriptions and the following blank line. > ! read_and_call_until(uhandle, consumer.description, blank=1) > ! read_and_call_while(uhandle, consumer.noevent, blank=1) > > consumer.end_descriptions() > > --- 207,214 ---- > read_and_call(uhandle, consumer.noevent, blank=1) > > # Read the descriptions and the following blank line. > ! read_and_call_until(uhandle, consumer.description, contains = > ') > ! read_and_call_while(uhandle, consumer.noevent, contains = > '') > > consumer.end_descriptions() I don't know for sure, but wouldn't this break compatibility with older formats? Jeff From chapmanb at arches.uga.edu Fri Oct 13 11:58:45 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Changes in WWW BLAST format In-Reply-To: Message-ID: <200010131558.LAA73594@archa11.cc.uga.edu> Jeff: > Could you send the output? I'd like to add it to the suite of blast tests. Surely, it's attached. There is also an example in Doc/examples (www_blast.py) which will give it you as well (I found the format change when writing that example). [patch with no respect for back-compatibility] Jeff: > I don't know for sure, but wouldn't this break compatibility with > older formats? *slaps self in forehead* Doh! Respect for old formats. Sorry, I guess I was just too excited that I actually figured out how to fix the blast parser :-). Here's a better patch which supports the new change, and passes all of the tests with the old formats. Let me know if there are any better ways to do things. Thanks for catching my mistake. Now I remember why you are the master of blast (not that I really needed any reminding :-). Brad *** NCBIWWW.py.orig Sat Aug 12 16:23:24 2000 --- NCBIWWW.py Fri Oct 13 11:47:12 2000 *************** *** 149,154 **** --- 149,155 ---- attempt_read_and_call(uhandle, consumer.noevent, start='RID') # Read the Query lines and the following blank line. + read_and_call_until(uhandle, consumer.noevent, contains='Query=') read_and_call(uhandle, consumer.query_info, contains='Query=') read_and_call_until(uhandle, consumer.query_info, blank=1) read_and_call_while(uhandle, consumer.noevent, blank=1) *************** *** 203,212 **** start='Sequences producing') read_and_call(uhandle, consumer.noevent, blank=1) ! # Read the descriptions and the following blank line. ! read_and_call_until(uhandle, consumer.description, blank=1) ! read_and_call_while(uhandle, consumer.noevent, blank=1) ! consumer.end_descriptions() def _scan_alignments(self, uhandle, consumer): --- 204,220 ---- start='Sequences producing') read_and_call(uhandle, consumer.noevent, blank=1) ! # Read the descriptions ! read_and_call_while(uhandle, consumer.description, ! blank = 0, contains = ' ! if attempt_read_and_call(uhandle, consumer.noevent, blank = 1): ! read_and_call_while(uhandle, consumer.noevent, blank = 1) ! # otherwise we've got a (introduced in 2.1.1) ! else: ! read_and_call_while(uhandle, consumer.noevent, contains = '') ! consumer.end_descriptions() def _scan_alignments(self, uhandle, consumer): -------------- next part -------------- A non-text attachment was scrubbed... Name: m_cold_blast.out Type: application/x-unknown Size: 19423 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001013/2e58541a/m_cold_blast.bin From katel at worldpath.net Sat Oct 14 04:08:42 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Gui References: <01fc01c034da$1837d140$85ab323f@josiah> Message-ID: <009601c035b5$f8976d00$010a0a0a@cadence.com> I just integrated SeqGui.py with Translate.py and Transcribe.py. To support the testing, I also wrote new unit tests for Transcribe.py in TranscribeTestCase.py. Cayte From katel at worldpath.net Sat Oct 14 21:57:53 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] clustalw question Message-ID: <003201c0364b$55dbf720$c3dc85d0@g0fjl> clustal_format.py only allows asterisks and spaces in the last line of an alignment. I just ran an alignment from: http://www2.ebi.ac.uk/clustalw/ The equivalent line contained colons and periods, too. The regexp is match_stars = Martel.Group("match_stars", Martel.Re("[ \*]+") + Martel.Opt(Martel.Str("\n"))) I'll send the output if you like. Cayte -------------- next part -------------- An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001014/e6822020/attachment.htm From chapmanb at arches.uga.edu Sun Oct 15 03:35:06 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] clustalw question In-Reply-To: <003201c0364b$55dbf720$c3dc85d0@g0fjl> Message-ID: <200010150737.DAA129956@archa13.cc.uga.edu> Cayte wrote: > clustal_format.py only allows asterisks and spaces in the last line > of an alignment. I just ran an alignment from: > > http://www2.ebi.ac.uk/clustalw/ > > The equivalent line contained colons and periods, too. Thanks for trying it out, and thanks for the catch! I'll happily fix it to accept this output. > The regexp is > > match_stars = Martel.Group("match_stars", > Martel.Re("[ \*]+") + > Martel.Opt(Martel.Str("\n"))) So, for a quick fix, you can change the second line to: Martel.Re("[ :\*\.]+") > I'll send the output if you like. Please do, and I'll add it to the test suite and fix the parser. I just poked around a bit to see what that line actually means, and starss are identical residues, colons are conserved substitutions and periods are semi-conserved substitutions. Neat! I never saw these since I have been using Clustalw to align nucleic acids and not proteins. Thanks again for catching this! Brad From katel at worldpath.net Sun Oct 15 15:44:19 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] removing boiler plate Message-ID: <007c01c036e0$50645220$98dc85d0@g0fjl> In using Martel, how do we strip boiler plate that may vary from site to site? Things like user instructions, legends for graphics, etc. Cayte -------------- next part -------------- An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001015/24915ca0/attachment.htm From dalke at acm.org Sun Oct 15 22:52:28 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] removing boiler plate Message-ID: <000a01c0371c$36ddee60$a4ab323f@josiah> Cayte: > In using Martel, how do we strip boiler plate that may vary from site to site? > Things like user instructions, legends for graphics, etc. That's going to depend on the boiler plate. For example, suppose there's an arbitrary amount of header text which is site specific, followed by the site independent text. Suppose also that the transition occurs with a line containing 5 =s ("====="). You can use Re(".*\n") to grab all of the header lines, but this will also grab the "=====\n" line. Instead, use a negative lookahead assertion to match all lines except the =s line, as in Re("(?!=====).*\n"). Of course, you'll want to get all of those lines, so header = Rep(Re("(?!=====).*\n")) The re documentation covers both positive and negative lookaheads. Andrew From jchang at SMI.Stanford.EDU Thu Oct 19 19:44:01 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:52 2005 Subject: [Biopython-dev] Changes in WWW BLAST format In-Reply-To: <200010131558.LAA73594@archa11.cc.uga.edu> Message-ID: Thanks for the update! I've incorporated your suggested changes. Please let me know if this works out. Jeff On 13 Oct 2000, Brad Chapman wrote: > Jeff: > > Could you send the output? I'd like to add it to the suite of blast > tests. > > Surely, it's attached. There is also an example in Doc/examples > (www_blast.py) which will give it you as well (I found the format change > when writing that example). > > [patch with no respect for back-compatibility] > Jeff: > > I don't know for sure, but wouldn't this break compatibility with > > older formats? > > *slaps self in forehead* Doh! Respect for old formats. Sorry, I guess I > was just too excited that I actually figured out how to fix the blast > parser :-). > > Here's a better patch which supports the new change, and passes all of > the tests with the old formats. Let me know if there are any better ways > to do things. Thanks for catching my mistake. Now I remember why you are > the master of blast (not that I really needed any reminding :-). > > Brad > > > *** NCBIWWW.py.orig Sat Aug 12 16:23:24 2000 > --- NCBIWWW.py Fri Oct 13 11:47:12 2000 > *************** > *** 149,154 **** > --- 149,155 ---- > attempt_read_and_call(uhandle, consumer.noevent, start='RID') > > # Read the Query lines and the following blank line. > + read_and_call_until(uhandle, consumer.noevent, > contains='Query=') > read_and_call(uhandle, consumer.query_info, contains='Query=') > read_and_call_until(uhandle, consumer.query_info, blank=1) > read_and_call_while(uhandle, consumer.noevent, blank=1) > *************** > *** 203,212 **** > start='Sequences producing') > read_and_call(uhandle, consumer.noevent, blank=1) > > ! # Read the descriptions and the following blank line. > ! read_and_call_until(uhandle, consumer.description, blank=1) > ! read_and_call_while(uhandle, consumer.noevent, blank=1) > ! > consumer.end_descriptions() > > def _scan_alignments(self, uhandle, consumer): > --- 204,220 ---- > start='Sequences producing') > read_and_call(uhandle, consumer.noevent, blank=1) > > ! # Read the descriptions > ! read_and_call_while(uhandle, consumer.description, > ! blank = 0, contains = ' ! > ! # two choices here, either blanks lines or a > ! if attempt_read_and_call(uhandle, consumer.noevent, blank = 1): > ! read_and_call_while(uhandle, consumer.noevent, blank = 1) > ! # otherwise we've got a (introduced in 2.1.1) > ! else: > ! read_and_call_while(uhandle, consumer.noevent, contains = > '') > ! > consumer.end_descriptions() > > def _scan_alignments(self, uhandle, consumer): From jchang at SMI.Stanford.EDU Thu Oct 19 19:57:40 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: <200010121827.OAA33200@archa10.cc.uga.edu> Message-ID: I'm not sure what's going on, but it looks like BLAST may be masking out low-complexity regions and ending up with little or nothing to search with. Unfortunately, there's nothing in the output that clearly tells us what's going on. For example, it'd be nice if there were a message explaining why the parameters are missing. Although something's clearly wrong here, I'm hesitant to try and diagnose the error within the parser. I don't know what's a real syntax error and what's a BLAST error. However, perhaps we can push the error detection higher up. Possible solutions might be: 1) developed a Parser that could catch a SyntaxError, do some diagnostics on the Record, and then raise a BlastError 2) make the parameters section optional in the Scanner, and then let the user either check the Record, or adapt the Consumer to check Would either of these be helpful? Or something else? Jeff On 12 Oct 2000, Brad Chapman wrote: > Hello again; > More blast stuff from me -- can you tell I've had to parse a lot > of BLAST reports recently? :-). > > Anyways, I've been using the standalone BLAST parser to parse > some big ol' BLAST runs that I'm doing, and I noticed that occassionally > blastall will report an error while running. This a pretty uninformative > error, and will generally either say something about being unable to > calculate parameters during the BLAST. Well, I investigated further and > found out that BLAST quits trying to run a search when it gets to a junk > sequence like this: > > >gi|9854647|gb|BE599574.1|BE599574 PI1_77_C09.g1_A002 Pathogen induced 1 > (PI1) Sorghum bicolor cDNA, mRNA sequence > TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTT > TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT > TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAA > > Right, so this is useless junk sequence and BLAST is right to bomb out on > it. > > The report that BLAST generates on something like this is attached. > Basically, the problem is that a truncated report missing all of the > statistics at the end. This causes the parser to run out of lines without > finding the statistics it is looking for and generate a SyntaxError. > > What I'd like to propose is that the parser generate a new exception for > these kind of reports, a NCBIStandalone.BlastError exception, indicating > a failure in Blast, not in the parser. > > The reason I want to do this is that I would like to rig the exception up > to return the query that failed in this way, so that I can easily send > some messages to the owners of these sequences, asking them to kindly > remove the sequence from GenBank. > > Anyways, attached is a patch (NCBIStandalone.diff) that implements this > type of exception-raising behavior for the BlastParser, which allows you > to parse like this: > > try: > b_record = iterator.next() > except NCBIStandalone.BlastError, info: > print 'Got a blast error on query', info[1] > > Do people think this is a good idea and something that can get into the > standalone parser? Comments are very welcome! > > Brad From chapmanb at arches.uga.edu Sun Oct 29 11:09:28 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: References: <200010121827.OAA33200@archa10.cc.uga.edu> Message-ID: <14844.19384.149965.283975@taxus.athen1.ga.home.com> [I was having problems with BLAST failing on really low-quality sequences from GenBank] Jeff: > I'm not sure what's going on, but it looks like BLAST may be masking out > low-complexity regions and ending up with little or nothing to search > with. Unfortunately, there's nothing in the output that clearly tells us > what's going on. For example, it'd be nice if there were a message > explaining why the parameters are missing. Agreed. The BLAST report doesn't look like there is really any problem (it just looks like it didn't find any hits). There are error messages in the xterm when you are running it from the command line, but they aren't very helpful either, since they don't have any info about which sequences are failing. > Although something's clearly wrong here, I'm hesitant to try and diagnose > the error within the parser. I don't know what's a real syntax error and > what's a BLAST error. This is a very good point. We don't want to cluter the parser trying to deal with BLAST errors. > However, perhaps we can push the error detection higher up. Possible > solutions might be: > 1) developed a Parser that could catch a SyntaxError, do some diagnostics > on the Record, and then raise a BlastError I really like this option, and think this is a good way to go. I have been doing something semi-similar to find the bad records in my big BLAST files, which basically involves: 1. Using the iterator (without a parser) to grab records one at a time from the file. 2. Copying the handle so we can parse it and have an extra copy to work with later. 3. Parse the record I got. If I get a SyntaxError, figure out what is wrong with the record (right now I've just been writing it out to a file. I actually wrote about this in the documentation (section 3.1.7) so that should give you a better idea of what exactly I'm trying to do. What do you think about generalizing this somehow to get the kind of functionality you are talking about? I'm not sure if there is a better way to do it, and I don't know how much overhead is introduced by copying the handle. So I'm very open to suggestions on this... Thanks! Brad From chapmanb at arches.uga.edu Sun Oct 29 11:35:56 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel timings Message-ID: <14844.20972.867166.469991@taxus.athen1.ga.home.com> [Martel thread] I wrote: > > Hmmm, one side note about RecordReader. I really like the way you can > > interface with the parsers in multiple ways in the current Biopython > > parsing. I think it is really useful to be able to iterate over a > record > > and get the record back, instead of automatically having to parse it Andrew: > I've got experimental code for that at > http://www.biopython.org/~dalke/SaxRecords.py > > It uses a new threads for the callback object and a Queue to send > parsed > records back to the iterator interface in the originating thread. > Currently looses memory if the thread doesn't go to completion because > the new thread is sitting there waiting for the queue to empty. Hmmmmm, I admit I am having lots of problems groking this -- I think my mind must be really cloudy. I just can't exactly see why using threads is the best way. The way that Biopython parsers work is: 1. Get a handle with the next record in a big file. 2. If a parser is passed, parse the handle and return the results. Otherwise (no parser), return the handle itself. This seems to make more sense (ie. simpler for my simple mind :-), but I'm not sure -- what are your thoughts? [helpful description of errors in Martel] I wrote: > > Do you think there is a way to make the RecordReader act similar to > > the Iterators in this regard? Andrew: > So yes. Convert the "fatalError" events to "error" and do recovery by > skipping to the next record. Then have the SaxRecords code, which > does the > Iterator-like interface, return the right information for > problematical > records. > > Umm, what does the Iterator do for bad records? It looks like it > raises > an exception, but allows you to call next() to get the next record? > That's reasonable to me (since I think I can support it :) Yup, that's the way Iterator works, which would be very nice. It would be a serious pain to have a huge parse completely die near the end because of a single bad record. There is also the issue I was just discussing with Jeff about getting back bad records and trying to find why they are bad (ie. in BLAST output, but I would imagine it might be helpful in other cases as well -- badly formatted GenBank entries that the parser doesn't like?). > I'll work on it; unless you want to do it? I can try, although I'm not exactly positive about the best way to proceed. This is related (at least in my mind) with the other problem I was discussing with Jeff... [cool new stuff in the RecordReader] > Thanks! If you didn't notice, it also plays some tricks to read ahead > many lines, which should give better overall performance. The > File.UndoHandle isn't as tricky but has better guarantees of where > it is in the file and > it allows undos, which Martel doesn't need. I bet changing the code > to read ahead multiple lines would speed up the existing biopython code. Yeah, this stuff is very cool. My mind is still kind of blown away by both this and Jeff's File.UndoHandle stuff -- it is really nifty that you can do so much cool stuff with the handles! Brad From katel at worldpath.net Sun Oct 29 20:12:47 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel Message-ID: <000701c0420e$84e5c540$010a0a0a@cadence.com> I was just writing some unit tests, with my tool, for Martel. It failed the AtEnd test on Windows. I wonder if this is one of those Unix/Dos things? Cayte From dalke at acm.org Sun Oct 29 20:15:12 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel Message-ID: <011401c04210$84c6b9a0$13ac323f@josiah> Cayte: > I was just writing some unit tests, with my tool, for Martel. It failed >the AtEnd test on Windows. I wonder if this is one of those Unix/Dos >things? Yes, it is. The current code requires "\n" and does't allow "\r\n". I still haven't sat down to figure out the details of unix vs. dos end-of-line conventions. Andrew From dalke at acm.org Mon Oct 30 02:58:13 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Martel timings Message-ID: <016d01c04247$418f5e80$13ac323f@josiah> Brad: >Hmmmmm, I admit I am having lots of problems groking this -- I think >my mind must be really cloudy. I just can't exactly see why using >threads is the best way. It's the solution for a somewhat different problem. Suppose you have an arbitrary SAX interface, where you cannot change the event generation code, and want to turn it into an iterator interface. One way to implement it is to store the events on a list then after the callbacks are finished, scan it to produce the records. This has the problem of storing all of the events before processing them, so there can be some memory problems. Another way is to spawn off a new thread and do the processing there. When a record is processed, send it over to the original thread. (I believe this would work even better using Stackless Python.) This is the most general but is (as you noticed) more complex. I said "somewhat different problem" because we have control over the Martel definitions. There's already a specialization (RecordParser) which has better memory usage for record oriented data. By definition, that means it can be used to convert all of a record's callback events into a list of events, as in the first possibility, then scan the list to create records. So what I've done is add a new method to the expression objects called "make_iterator" just like they have a "make_parser" method. The make_iterator takes a string, which is the tag name used at the start&end of the record. The object returned parse(...), parseString(...) and parseFile(..) just like the parser object returned from "make_parser", except it also takes a second parameter which is used to make records. That description is easier to understand as code: iterator = format.make_iterator("swissprot38_record") for record in iterator.parseString(text, make_biopython_record): ... The implementation uses an EventStream protocol. An EventStream has a '.next()' method, which returns a list of events. If there are no events, it returns None. In the standard case, the EventStream converts all of the input into a list of events and returns it. For a record reader, each call of next reads a record and returns its events. The EventStream object is passed to Iterator class's constructor, which is a forward iterator for reading records (the 'for record in ...' part of the above). When *its* .next() is called, it starts processing the list of available events, calling the EventStream if more events are needed. As it scans the list, it looks for the start and end tags. Everything inside of those tags are passed to the SAX parser object created by the factory object passed in (the 'make_biopython_record'). It also sends startDocument/endDocument events. The Iterator's next() method returns the created SAX parser objects. Again, it's easier to use than describe. This approach, BTW, is vaguely similar to the pulldom of Paul Prescod's. The nice thing about the "make_iterator" API is that is supports both this event stream approach and also allows threads, if there's no way to modify the parser code. >Andrew: >> Umm, what does the Iterator do for bad records? It looks like it >> raises >> an exception, but allows you to call next() to get the next record? >> That's reasonable to me (since I think I can support it :) > >Yup, that's the way Iterator works, which would be very nice. It would >be a serious pain to have a huge parse completely die near the end >because of a single bad record. After reflection, I've come to a different conclusion about how to handle bad records. It's really easy to make a new format which handles swissprot records as well as errors. format = ParseRecords(swissprot38.format | Rep(Group("bad_record", Re("^((?!//)[^\n]*\n)*//\n"))), EndsWith("//\n")) (I don't have the source code available now, so the syntax is probably a bit off.) Then the SAX parser for records just needs to know how to handle swissprot38_record and bad_record records. I like this because I like strict code, where you have to be explicit to tell it how to ignore errors. Plus, if you want to do some recovery with data extraction, it could switch to a different syntax which might not be as strict (like '(?P..) (?P[^\n])\n') >> I'll work on it; unless you want to do it? > >I can try, although I'm not exactly positive about the best way to >proceed. I've got the iterator code mostly working. I'm doing documentation and adding more regression tests. How about when I finish I send you a version to test out? Don't ask when :( Andrew dalke@acm.org From thomas at cbs.dtu.dk Mon Oct 30 22:20:45 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] GenBank parser ? Message-ID: <14846.14989.117057.444180@delphinus.cbs.dtu.dk> Hej All, Do we - or someone else - have a genbank parser ? I remember something came up in the news groups, but I cannot find it anymore ... thx -thomas -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From jchang at SMI.Stanford.EDU Mon Oct 30 14:25:28 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] GenBank parser ? In-Reply-To: <14846.14989.117057.444180@delphinus.cbs.dtu.dk> Message-ID: No. The currently plan is to use this as a test case for Martel. Any takers? :) Jeff On Tue, 31 Oct 2000 thomas@cbs.dtu.dk wrote: > Hej All, > > Do we - or someone else - have a genbank parser ? I remember something came > up in the news groups, but I cannot find it anymore ... > > thx > -thomas > > > -- > Sicheritz Ponten Thomas E. CBS, Department of Biotechnology > thomas@biopython.org The Technical University of Denmark > CBS: +45 45 252489 Building 208, DK-2800 Lyngby > Fax +45 45 931585 http://www.cbs.dtu.dk/thomas > > De Chelonian Mobile ... The Turtle Moves ... > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From chapmanb at arches.uga.edu Mon Oct 30 16:26:10 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] GenBank parser ? In-Reply-To: References: <14846.14989.117057.444180@delphinus.cbs.dtu.dk> Message-ID: <14845.59250.69115.532080@taxus.athen1.ga.home.com> Thomas: > > Do we - or someone else - have a genbank parser ? I remember something came > > up in the news groups, but I cannot find it anymore ... Jeff: > No. There are a couple of ways around this that I have found which allow you still use python to get at GenBank: 1. Use jpython and the biojava libraries for parsing GenBank. I attached a file which shows a basic example of doing this. 2. Use the python BioCorba interface (biopython-corba). I use a bioperl based server and a biopython client, and this works quite well, at least for what I'm doing (parsing out CDS info). Sometime soon I hope to make a new release of biopython-corba with documentation on how to do stuff like this. I just need to revise the docs, and do some more testing to make sure everything in CVS is kosher. If you are interested in trying this way, I would definately be willing to help (hey, it would be quite exciting to have someone using biopython-corba besides me :-). Jeff: > The currently plan is to use this as a test case for Martel. Any > takers? :) I think one of our biggest sticking points is that we don't really have anything in terms of features, which would be really really useful to parse the GenBank files into. It seems like it is pretty tricky to have classes which can deal with all of the possible complexities of GenBank (also EMBL) formats, so it would be nice to think of and implement some feature classes which do this first. There was an interesting discussion about some of this on the biocorba list (in the October archives under the threads 'Biocorba IDL -- Clarifications' and 'SeqFeatures and the EMBL IDL'). Anyways, I don't have much time at the moment to work on this 100%, but would be willing to do part o' the coding/hashing things out if other people are willing to work on it as well. I think once we have a feature class, the GenBank parser won't be too incredibly horrible to do from Martel (fingers crossed :-). Brad -------------- next part -------------- #!/usr/bin/env jpython """Read info from GenBank files. This uses jpython and biojava (http://www.biojava.org) to read from a GenBank file. This is basically a jpython translation of demos/seq/TestGenbank.java""" # standard python libs import os # java stuff from java.io import * # biojava from org.biojava.bio.seq.io import * from org.biojava.bio import * from org.biojava.bio.symbol import * from org.biojava.bio.seq import * # set up the files file = os.path.join('test.gb') gb_file = File(file) reader = BufferedReader(InputStreamReader(FileInputStream(gb_file))) # set up biojava stuff to parse the files alphabet = DNATools.getDNA() seq_factory = SimpleSequenceFactory() parser = alphabet.getParser("token") gb_format = GenbankFormat() iterator = StreamReader(reader, gb_format, parser, seq_factory) while iterator.hasNext(): seq = iterator.nextSequence() print 'name:', seq.getName() print 'num features:', seq.countFeatures() From thomas at cbs.dtu.dk Tue Oct 31 02:36:14 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] genbank parser Message-ID: <14846.30318.641161.786999@delphinus.cbs.dtu.dk> ok - now I remember where I have seen a genbank parser ... Object-oriented parsing of biological databases with Python Chenna Ramu, Christine Gemuend and Toby J. Gibson Bioinformatics, Volume 16, Issue 7, Pages 628-638 : July 2000 http://shag.embl-heidelberg.de:8000/Biopy/ testing it right now ... c ya -thomas -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From jchang at SMI.Stanford.EDU Tue Oct 31 21:58:20 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:53 2005 Subject: [Biopython-dev] Proposed addition to Standalone BLAST In-Reply-To: <14844.19384.149965.283975@taxus.athen1.ga.home.com> Message-ID: > What do you think about generalizing this somehow to get the kind of > functionality you are talking about? I'm not sure if there is a better > way to do it, and I don't know how much overhead is introduced by > copying the handle. So I'm very open to suggestions on this... Sure. Having some code that would help to diagnose errors in BLAST reports would be a very nice feature. Certainly more user friendly than having SyntaxError this or SyntaxError that. We would have to build this on top of the current exceptions, though. It's still nice to have the SyntaxErrors under the hood, as an explanation on why the parser is complaining in the first place. How are you copying the handle? If you read the contents of the handle as a string (ummm, could be iffy parsing PSI-BLAST on RAM-starved machines, but probably not a problem), and then wrapped a StringHandle around it, there should be little overhead aside from the string containing the blast results. Jeff