From jchang at smi.stanford.edu Wed Jan 2 01:03:32 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module In-Reply-To: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com> References: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020101220332.C499@krusty.stanford.edu> On Mon, Dec 31, 2001 at 03:36:24AM -0700, Andrew Dalke wrote: > Code is at http://www.biopython.org/~dalke/Bioformats-0.2.py OK, I've read the README. I'll say it. "Wow! That's cool!" :) This'll really simplify things a lot for people, to have a uniform API to loading and parsing data. Jeff From adalke at mindspring.com Wed Jan 2 08:27:40 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] mixins Message-ID: <003101c19391$413c5ae0$0201a8c0@josiah.dalkescientific.com> Been working on mixins all night. The idea is that only parts of a file are important -- you may just want the sequence, or the cross references, or the whatever. If those fields are consistently tagged (been working on that as well) then standard parsers can be used for the different segments. Some of the experimental mixins I have are dbid -- gives the primary/secondary/accessions description -- gives the main description text dbxref -- cross references to other databases features -- sequence features sequence -- sequence data A problem with the standard SAX method is that is uses a centralized set of methods, like 'startElement'. Mixins can't each define their own startElements since only one is called. So I made a DispatchHandlers which converts calls like startElement('spam', {}) into start_spam('spam', {}) in that way, the different handlers could listen only for their associated event. And for the 'characters' method, I have a stack based was to start and stop saving characters. when a mixin is done, it calls a specific function back in the handler, which so far start with 'add_'. But wait, there's more! Jeff pointed out namespace support, which XML supports with a syntax like 'ns:spam'. It's kinda cumbersome using a ':' as a method name, so I've translated that to "ns__spam" when I do the dispatching. This lets people define a new builder with something like class FastaBuilder(dbid, description, sequence, SaveText, DispatchHandler): def __init__(self): ... call __init__ on the bases def start_record(self, tag, attrs): self.id = None self.description = None self.seq = None def add_dbid(self, dbid): ... def add_sequence(self, seq): ... def end_record(self, tag): self.document = FastaRecord(self.id, self.description. self.seq) Now, writing that list of mixins is cumbersome, so I used new.classobject so you can define FastaBuilderBase = MixinBuilder(dbid, description, sequence) class FastaBuilder(FastaBuilderBase): ... Another problem with mixins is that they share the same __dict__. That can lead to hard-to-track-down mixups. So I've written a way for a mixing to acquire methods from another handler, but not share the same __dict__. It looks like this: class Handle_sequence(Callback): def start_bioformat__sequence(self, tag, attrs): self.alphabet = attrs.get("alphabet", "any") def end_bioformat__sequence(self, tag): seq = Sequence based on the alphabet and the characters self.callback(seq) # Here's the mixin class sequence: def __init__(self): acquire(self, Handle_sequence(self, self.add_sequence)) def add_sequence(self): pass The 'acquire' function pulls off all methods starting with 'start_' and 'end_' and sticks them in the mixin'a namespace. So it looks like the sequence implements things but it's really Handle_sequence. And there's no possibility of 'self.alphabet' being overridden by anyone else. (It's actually slightly more complicated than this because the acquisition can put on its own prefix, which helps with code reuse.) Finally! Since Python is fully introspective, the DispatchHandler can peer through the class hierarchy to figure out all of the methods which are defined, and map them back to their proper SAX tags. This list of tags can then be used to build a new expression tree with all the other, unused tags filtered out. What that means is, if you want to get more fields, stick in a new mixin, and everything works automatically to get those fields, with only the expected slowdown associated with the extra work to identify and parse those fields. The less data you want, the faster it is. With bare minimal (what's needed to convert the data into FASTA format), my test set of SWISS-PROT 38 takes (estimated) 10 minutes. With everything needed for the SProt data structure, it's slightly under 30 minutes, which is about what the current code requires. (Times estimated by extrapolation of my smaller test set.) Code is in a state of flux and not really work others looking at it right now. I'll work on it more tomorrow. Hope to make it available on Friday. Andrew From chapmanb at arches.uga.edu Thu Jan 3 22:22:51 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module In-Reply-To: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com> References: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020103222251.A3135@ci350185-a.athen1.ga.home.com> Hey Andrew and all; [Bioformats] > I think it's at the stage where the code can be added > to Biopython proper. I would like someone else to > take a look at it first, if only to try it out. (It > wouldn't hurt to also say "Wow! That's cool!" :) I'll second the "Wow! That's cool" from Jeff :-). I like the way things are going with the Bioformats module. I just had some time to play with it, and it is very nice. After some small modifications to the GenBank format, I got GenBank minimally working with it. Snazzy -- conversions to Fasta format: >>> from Bioformats import registry >>> infile = open("/home/chapmanb/bioppjx/biopython/Tests/GenBank/iro.gb") >>> format = registry["sequence"].identify(infile) >>> print format.name genbank >>> from Bioformats import IO >>> infile.seek(0) >>> writer = IO.io.make_writer(format = "fasta") >>> for record in IO.io.readFile(infile): ... print record ... writer.write(record) >AL109817.1 cacaggcccagagccactcctgcctacaggttctgagggctcaggggacctcctgggccctcaggctcttta gctgagaataagggccctgagggaactacctgcttctcacatccccgggtctctgaccatctgctgtgtgcc [...] I like it! Attached is the format registration stuff, that goes in Bioformats/formats/genbank.py for anyone who is interested in duplicating this. I'm definately +1 on checking this into CVS. It seems along the same spirit as what Thomas was working on in Bio/SeqIO/generic, but integrates well with Martel. I'm not sure if I really have the full picture of everything yet, but from what I see it looks good! I'm excited about the mixin stuff as well -- it seems like it'll really simplify a lot of repetitive coding for adding new formats. Too bad I already did all the repetitive coding for GenBank :-). At-least-coding-monkeys-get-lots-of-bananas-ly yr's, Brad -- PGP public key available from http://pgp.mit.edu/ From adalke at mindspring.com Fri Jan 4 05:37:51 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module Message-ID: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com> Brad: >I'll second the "Wow! That's cool" from Jeff :-). Thanks! to both of you. And I guess you're running a 2.2 version of Python, since I have some 'yield' statements in there. :) > After some >small modifications to the GenBank format, I got GenBank minimally >working with it. There's going to be a few more changes. I've been working on standard tag names for things like identifiers, cross-references, sequence, and features (with qualifiers). Seems to work with well with SWISS-PROT and EMBL. The idea is to do Std.dbid(UntilSep(delimiter = ";"), {"type": "accession"}) and it puts in the correct tags. (BTW, I'm going to change "delimiter" to "sep".) >Attached is the format registration stuff, that >goes in Bioformats/formats/genbank.py for anyone who is interested >in duplicating this. Wasn't attached. >>>> infile.seek(0) Shouldn't need that. The identification code should always reseek the file to the beginning after it's finished. >I'm definately +1 on checking this into CVS. It seems along the >same spirit as what Thomas was working on in Bio/SeqIO/generic, but >integrates well with Martel. It was. I looked through the mailings to make sure I read his (and others') discussions. It's also (IMNSHO) much better than the Bioperl and BioJava codes because it can handle non-sequence formats, like BLAST results, as well. Should it be under Bio (Bio.Bioformats) or parallel to it? Unlike Martel, I don't see it as being distributed outside of Biopython, so I would think under. And I think the Biopython code will have hooks to it as well. Okay, so under it is. > I'm not sure if I really have the full >picture of everything yet, but from what I see it looks good! I'm giving a short talk Friday morning. I think I know what I'm doing well enough now that tomorrow evening I should be able to write an overview level description of the project. BTW, for me it was even harder to figure out the full picture. I had to do one piece at a time until it finally started to come together. >I'm excited about the mixin stuff as well -- it seems like it'll >really simplify a lot of repetitive coding for adding new formats. >Too bad I already did all the repetitive coding for GenBank :-). That was part of the small pieces -- see what works well then try to abstract from there. Mixins, however, turned out to be a dead end. There was a problem when multiple mixins wanted the same events. There was also the annoyance of having to __ all object variables in the hopes of not getting conflicts with other classes. So I used a different approach which actually makes things easier to understand, I hope. Like I said, tomorrow evening... Hopefully. Andrew dalke@dalkescientific.com From chapmanb at arches.uga.edu Fri Jan 4 07:03:43 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module In-Reply-To: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com> References: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020104070342.B3135@ci350185-a.athen1.ga.home.com> Andrew: > Thanks! to both of you. And I guess you're running a > 2.2 version of Python, since I have some 'yield' statements > in there. :) Yup, always on the cutting edge :-). Though, you may want to watch out for 'em -- don't think it'll make people too happy to have code that requires the brand-spanking-new-2.2. Especially since I still see messages with people using 1.5.2 (gack!). > There's going to be a few more changes. Definately understood. I was just interested in learning what was going on so I thought I would add GenBank to the fray. I should eventually work on a GenBank writer, etc. But not right now :-). > >Attached is the format registration stuff, that > >goes in Bioformats/formats/genbank.py for anyone who is interested > >in duplicating this. > > Wasn't attached. Whoops! I can just add this to CVS (if you don't mind), once you check things in. >> >>> infile.seek(0) > > Shouldn't need that. The identification code should always > reseek the file to the beginning after it's finished. Cool. Good to know -- I was just going directly off what was in the README. Don't want to stray too far from the path and get lost! > Should it be under Bio (Bio.Bioformats) or parallel to it? > Unlike Martel, I don't see it as being distributed outside > of Biopython, so I would think under. And I think the > Biopython code will have hooks to it as well. Okay, so under > it is. Sounds good to me. Idly, do you want Bio.Bioformats or Bio.Formats? Brad -- PGP public key available from http://pgp.mit.edu/ From jchang at smi.stanford.edu Fri Jan 4 11:43:03 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module In-Reply-To: <20020104070342.B3135@ci350185-a.athen1.ga.home.com> References: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com> <20020104070342.B3135@ci350185-a.athen1.ga.home.com> Message-ID: <20020104084303.B189@krusty.stanford.edu> On Fri, Jan 04, 2002 at 07:03:43AM -0500, Brad Chapman wrote: > Sounds good to me. Idly, do you want Bio.Bioformats or Bio.Formats? Hmmm... That would be: Bio.Bioformats vs Bio.Formats Bio.Bioformats.Format vs Bio.Formats.Format Bio.Bioformats.formats vs Bio.Formats.formats Actually, I would favor a refactoring that would put end-user modules like IO, Writer, Registry right under Bio. This would be consistent with the idea discussed in the last BOSC of having a wider tree to make it easier for people to find things. Jeff From adalke at mindspring.com Sat Jan 5 10:06:41 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module Message-ID: <001f01c195fa$962ed9e0$0201a8c0@josiah.dalkescientific.com> Brad: >Yup, always on the cutting edge :-). Though, you may want to watch >out for 'em -- don't think it'll make people too happy to have code >that requires the brand-spanking-new-2.2. Especially since I still >see messages with people using 1.5.2 (gack!). I'm not going to be able to look at getting the code to work under older Pythons, at least not for a week or so. There are two places which need to be fixed: - weakref was introduced in 2.1 and is used to prevent cyclical data structures - yield was introduced in 2.2 and is used as part of the iteration return value. The first can't be fixed very easily, but I don't think it will leak all that much memory. (Need to investigate.) The second is easy to fix - I just need to make an iterator adapter. But no time for now. :( BTW, I now have 90 minutes to talk at the O'Reilly conference. I'm taking suggestions for what to present. Andrew From adalke at mindspring.com Sat Jan 5 10:24:29 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bioformat module Message-ID: <002001c195fd$11a253c0$0201a8c0@josiah.dalkescientific.com> Jeff: >Hmmm... That would be: >Bio.Bioformats vs Bio.Formats >Bio.Bioformats.Format vs Bio.Formats.Format >Bio.Bioformats.formats vs Bio.Formats.formats > >Actually, I would favor a refactoring that would put end-user modules >like IO, Writer, Registry right under Bio. This would be consistent >with the idea discussed in the last BOSC of having a wider tree to >make it easier for people to find things. Very good point, and I had forgotten about that discussion. Okay, I did a somewhat hybrid solution and put a 'Format' in front of a couple names but otherwise I merged the two trees together. The new modules are: Format -- information about a format FormatRegistry -- knows how to use the format information FormatIO -- knows how to use the format registry Std -- defines standard XML tags Dispatch -- a set of classes to make it easier to mix and match handlers StdHandler -- a set of standard handlers, which use the standard tags to build portions of the data ReseekFile -- help reading from files which don't allow reseeking to the beginning of the file _FmtUtils -- internal support modules Writer -- base class for the output writers formatdefs -- high-level description of the formats expressions -- low-level Martel expressions for the formats builders -- makes data structures from Martel events writers -- turns data structures into output __init__ contains 'formats', which is an instance of a FormatRegistry. It reads the 'formatdefs' directory to get the configuration information. SeqRecord contains an 'io' object, which is an instance of FormatIO. As it is right now, the format support is rather weak. There are two formats -- swissprot/38 and an embl variation. There is one output format, FASTA. Here's an example of use >>> filename = "/home/sac/bioperl-live-sac/t/data/roa1.swiss" >>> from Bio import SeqRecord >>> for record in SeqRecord.io.readFile(open(filename)): ... print record.id ... print record.seq ... ROA1_HUMAN SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKR..... >>> Here's another (the description really is on a single line) >>> SeqRecord.io.convert(open(filename)) >ROA1_HUMAN HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN A1 (HELIX- DESTABILIZING PROTEIN) (SINGLE-STRAND BINDING PROTEIN) (HNRNP CORE PROTEIN A1). SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMN ARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKK RGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFG RGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDS YNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS SSSSYGSGRRF >>> Quickly speaking, here's what's going on: 1) format detection The 'formatdefs' contain a description of the different formats. Some formats are really lists of other formats, in a tree. The tree structure looks like this: sequence |- embl |- swissprot | `- swissprot/38 `- others The SeqRecord.io contains the default reader format, which is "sequence". The sequence format tries each of its children. Eventually, 'swissprot/38' works, which is returned as the format definition. 2) find the builder The SeqRecord.io contains a canonical name for the data type. In this case it's "SeqRecord." The file format has its own canonical name, which is "swissprot/38". They also have what I call an abbrev name, which is a name that can be used in the file system. The format's abbrev name is 'sprot38'. So the initial builder is found in Bio/builders/SeqRecord/sprot38.py However, this doesn't exist. Here's where the hierarchy comes into play. The hierarchy must be such that of Y is a child of X then all the tags which are defined in X must have the same meanings in Y. In that way, the parser to build from X can be used to build from Y. In other words, Bio/builders/SeqRecord/swissprot.py Bio/builders/SeqRecord/sequence.py if one exists, should be just as usable as .../sprot38.py This reduces the O(NxN) problem to a O(N) problem. The Bio.Std module defines standard tag names. 3) the format contains the Martel grammer, so once the builder is found, the file can be parsed. When a record is parsed, the content handler (the builders) must end up with a ".document" property. This is the object to use for a record. It's also what the DOM object use. By using this convention I know how to get the 'record' from the builders, to return in the for loop. 4) Output conversion is also done with canonical names. In this case, the SeqRecord also defines a default output format. (If not found, it searches down the hierarchy tree instead of up.) Writers have the following protocol: writeHeader() -- usually does nothing write(record) -- write a record writeFooter() -- usually does nothing 5) The Dispatch classes are designed to help with making new data structures easily. It's too complicated to explain right now. 6) To add a new format definition: a) make sure you understand the hierarchy requirement b) take a look at the swissprot and embl expressions/, to see how to use the Std module to define tags. (I need to think about 'style' for a while more.) c) edit the formatdefs directory to add the new format configuration. Okay, I can't go any further. I still need to pack for my trip. Be back next week. Enjoy! Andrew dalke@dalkescientific.com From Y.Benita at pharm.uu.nl Mon Jan 7 07:01:31 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Errors in compilation for Mac Message-ID: Hi guys, I am trying to make the new release for the Mac. I get a compilation error in cfastpairwisemodule.c: Error : cannot convert 'char *' to 'unsigned char *' cfastpairwisemodule.c line 113 direction_matrix = (char *)NULL; Project: cfastpairwise.mcp, Target: cfastpairwise, Source File: cfastpairwisemodule.c This error is repeated several times. Any suggestions how to fix it? Thanks, Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From idoerg at cc.huji.ac.il Mon Jan 7 09:05:36 2002 From: idoerg at cc.huji.ac.il (Iddo Friedberg) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Errors in compilation for Mac In-Reply-To: Message-ID: Hi Yair, For want of a better answer, could it be that the parameters with which you are running your C compiler are a bit too strict, typewise? Seems like the compiler refuses to cast a char to an unsigned char, which on most compilers can be mitigated to a 'warning' level error, which does not halt compilation. Try looking at your compilation parameters. Don't-know-mac-but-had-my-bout-with-C-comp'ly yours, Iddo On Mon, 7 Jan 2002, Yair Benita wrote: : Hi guys, : I am trying to make the new release for the Mac. : I get a compilation error in cfastpairwisemodule.c: : : Error : cannot convert : 'char *' to : 'unsigned char *' : cfastpairwisemodule.c line 113 direction_matrix = (char *)NULL; : Project: cfastpairwise.mcp, Target: cfastpairwise, Source File: : cfastpairwisemodule.c : : This error is repeated several times. : Any suggestions how to fix it? : Thanks, : Yair : -- : Yair Benita : Pharmaceutical Proteomics : Utrecht University : : _______________________________________________ : Biopython-dev mailing list : Biopython-dev@biopython.org : http://biopython.org/mailman/listinfo/biopython-dev : -- Iddo Friedberg | Tel: +972-2-6757374 Dept. of Molecular Genetics and Biotechnology | Fax: +972-2-6757308 The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il POB 12272, Jerusalem 91120 | Israel | http://bioinfo.md.huji.ac.il/marg/people-home/iddo/ From Y.Benita at pharm.uu.nl Mon Jan 7 12:00:42 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] What is reportlab.lib Message-ID: Hi guys, Still working on the Mac version of the new biopython. What is the the module reportlab.lib? It wasn't in the biopython files and its not in my python library. It is requested from test_GraphicsChromosome.py Besides that all tests past. Iddo, you were right about the compiler, thanks for the tip. Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From chapmanb at arches.uga.edu Mon Jan 7 15:48:32 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] What is reportlab.lib In-Reply-To: Message-ID: Hi Yair! > Still working on the Mac version of the new biopython. Great to hear you are working on this! > What is the the module reportlab.lib? This is the reportlab pdf generation library, which is just needed for some Graphics stuff. You can get it from: http://reportlab.com/download.html I'm pretty sure the module is all python, so I think all you should need to do is download it, unzip it, and stick it in site-packages. Hopefully everything will work okay on Macs... > Besides that all tests past. Great to hear! So we must be doing pretty well at cross-platformness :-) Brad From jchang at smi.stanford.edu Mon Jan 7 17:40:46 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Errors in compilation for Mac In-Reply-To: References: Message-ID: <20020107144011.A1590@krusty.stanford.edu> Thanks for reporting this. Iddo is correct, that this is probably harmless and nothing to worry about. In fact, I don't think gcc has a warning to catch this case. Nevertheless, I went through the file and changed the code to be more careful with typecasts. Thanks, Jeff On Mon, Jan 07, 2002 at 01:01:31PM +0100, Yair Benita wrote: > Hi guys, > I am trying to make the new release for the Mac. > I get a compilation error in cfastpairwisemodule.c: > > Error : cannot convert > 'char *' to > 'unsigned char *' > cfastpairwisemodule.c line 113 direction_matrix = (char *)NULL; > Project: cfastpairwise.mcp, Target: cfastpairwise, Source File: > cfastpairwisemodule.c > > This error is repeated several times. > Any suggestions how to fix it? > Thanks, > Yair > -- > Yair Benita > Pharmaceutical Proteomics > Utrecht University > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev From Y.Benita at pharm.uu.nl Tue Jan 8 05:12:29 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython available soon Message-ID: Hi guys, I have almost everything ready. Still some problems with the reportlab module. I don't have more time to spend before the weekend. If you want to post MacBiopython already, everything else is ready. Let me know. Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From jchang at smi.stanford.edu Tue Jan 8 12:21:44 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython available soon In-Reply-To: References: Message-ID: <20020108092144.A778@krusty.stanford.edu> Does everything install, except for reportlab? If so, that's probably ok. Correct me if I'm wrong Brad, but I don't think very many people on the mac will be needing this functionality? Jeff On Tue, Jan 08, 2002 at 11:12:29AM +0100, Yair Benita wrote: > Hi guys, > I have almost everything ready. Still some problems with the reportlab > module. > I don't have more time to spend before the weekend. If you want to post > MacBiopython already, everything else is ready. Let me know. > Yair > -- > Yair Benita > Pharmaceutical Proteomics > Utrecht University > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev From chapmanb at arches.uga.edu Tue Jan 8 12:30:57 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython available soon In-Reply-To: <20020108092144.A778@krusty.stanford.edu> References: <20020108092144.A778@krusty.stanford.edu> Message-ID: <20020108123057.A20508@ci350185-a.athen1.ga.home.com> Yair: > > I have almost everything ready. Still some problems with the reportlab > > module. > > I don't have more time to spend before the weekend. If you want to post > > MacBiopython already, everything else is ready. Let me know. That would be great -- I think you should just go ahead and post what you've got -- I can take a look at the reportlab/graphics problems and see if I can figure them out. Jeff: > Does everything install, except for reportlab? If so, that's probably > ok. Correct me if I'm wrong Brad, but I don't think very many people > on the mac will be needing this functionality? Yeah, I definately wouldn't consider non-working reportlab stuff a major problem. I can see if I can figure anything out and iron out any glitches before the next release so that we won't have problems with it. No one will need reportlab unless they want graphics, which is a very minor component. pdf-generation-is-only-necessary-around-poster-making-time-ly yr's, Brad From Y.Benita at pharm.uu.nl Tue Jan 8 12:38:25 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython download Message-ID: OK than, everything works except: 1. Reportlab module- I will not to fix it unless someone asks for it. 2. Local blast. It is compiled in the most lame way for the Mac. I have to go and fix stuff in the blast code and then recompile. Its too much work for now but its on my list. MacBiopython 1.00a4 can be downloaded from: http://homepage.mac.com/ybenita Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From chapmanb at arches.uga.edu Tue Jan 8 20:32:32 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython download In-Reply-To: References: Message-ID: <20020108203232.A12021@ci350185-a.athen1.ga.home.com> Hi Yair; > 1. Reportlab module- I will not to fix it unless someone asks for it. No problem. I think that sounds like a good plan. If I have time I'll try to take a look at it, since I introduced that dependency. > 2. Local blast. It is compiled in the most lame way for the Mac. I have to > go and fix stuff in the blast code and then recompile. Its too much work for > now but its on my list. Sounds like a good plan. > MacBiopython 1.00a4 can be downloaded from: > http://homepage.mac.com/ybenita Thanks much for getting it together! I've included in on the biopython download page. By the way, this release is compiled for python 2.1? Just want to make sure I've got the right information. Thanks again. Brad -- PGP public key available from http://pgp.mit.edu/ From Y.Benita at pharm.uu.nl Wed Jan 9 04:54:51 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] MacBiopython download In-Reply-To: <20020108203232.A12021@ci350185-a.athen1.ga.home.com> Message-ID: on 9/1/2002 2:32, Brad Chapman at chapmanb@arches.uga.edu wrote: > Thanks much for getting it together! I've included in on the > biopython download page. By the way, this release is compiled for > python 2.1? Just want to make sure I've got the right information. New biopython release was compiled for the MacPython 2.2 not for 2.1, that?s one of the reasons reportlab does not work. I get an error that it was compiled for 2.1 so I have to recompile all their extensions for 2.2 Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From thomas at cbs.dtu.dk Fri Jan 11 16:09:08 2002 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] WU-Blast parser ? Message-ID: Hej, Does someone have a working WU-blast parser or are there any plans to make a NCBIstandalone compatible parser for WU-Blast? cheers -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From adalke at mindspring.com Mon Jan 14 21:08:37 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] WU-Blast parser ? Message-ID: <043d01c19d69$8bd7ff00$0201a8c0@josiah.dalkescientific.com> Hej Thomas, >Does someone have a working WU-blast parser or are there any plans to make a >NCBIstandalone compatible parser for WU-Blast? I started working on one last week for Martel. If you send me some sample data files it would help. I've dug up a few, but I don't know how much that format changes over time. I'm also trying to get the Martel tag names to be similar to the Bioperl ones, with the though of normalizing both the different homology search results across Biopython as well as between Biopython and Bioperl. Andrew From adalke at mindspring.com Tue Jan 15 09:36:16 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] changes Message-ID: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com> I've committed some changes in CVS: - added 'fasta' reader - modified the 'genbank' reader to take the new-style Std tags (also fixed the "" bug in feature qualifier values) (the parsing is about 1/2 the performance of SWISS-PROT; haven't figured out why) - added a 'block' parser, but no builder yet - added the beginnings of a 'blast' parser - added a DBXRef module for database cross references - a couple additions to Martel: - SkipLinesUntil / SkipLinesTo ... equivalent to while not pattern.match(line): line = infile.next() - can now iterate HeaderFooter records - the SeqRecord now stores features and database cross references - the SeqRecord stores a Seq instead of a string - genbank and swissprot records set the correct alphabet type - the 'parse' and 'identify' commands now take XML 'source' objects, which can be a URL or a file handle. Huh. Guess I was busy this evening. Still need to run full regressions against the new formats. BTW, the FASTA reader tries to parse the NCBI fields and generate appropriate dbxref fields for the SeqRecord. Andrew dalke@dalkescientific.com From thomas at cbs.dtu.dk Sun Jan 13 11:09:49 2002 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: "Andrew Dalke"'s message of "Tue, 15 Jan 2002 07:36:16 -0700" References: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com> Message-ID: ... Thats bad .... A CVS update breaks _all_ existing code which uses the Fasta reader ! (because of the 2.2 specific things .. weakref, generators etc.) _URGENT_:Does anybody know how to undo an CVS update ? at-hyperspeed-reading-the-CVS-manual-for-the-first-time'ly yr's -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From chapmanb at arches.uga.edu Wed Jan 16 11:13:36 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: References: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020116111336.A56180@ci350185-a.athen1.ga.home.com> > A CVS update breaks _all_ existing code which uses the Fasta reader ! > (because of the 2.2 specific things .. weakref, generators etc.) Hmmm, I thought I fixed this... Looks like Andrew backed out my change. Not sure why. Maybe he fixed the 2.2 specific stuff, but I guess not. Don't know. Anyways, if you want to fix things now, the only intrustive part is in Bio/SeqRecord. Just comment out the FormatIO related stuff at the top of this file, and you should be able to use everything except the new stuff just fine. No-CVS-mojo-needed-to-comment-out-bad-parts-ly yr's, Brad From adalke at mindspring.com Wed Jan 16 11:29:53 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes Message-ID: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com> >A CVS update breaks _all_ existing code which uses the Fasta reader ! >(because of the 2.2 specific things .. weakref, generators etc.) weakref was added in 2.1 I got rid of the 'yield' statements in my code. .... but forgot to get rid of the __future__ statement I've committed changes to make 'Bio.Fasta' importable under 2.0 [dalke@pw600a ~]$ date Wed Jan 16 11:13:47 EST 2002 [dalke@pw600a ~]$ python2.0 Python 2.0 (#4, Dec 8 2000, 21:23:00) [GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2 Type "copyright", "credits" or "license" for more information. >>> from Bio import Fasta >>> >_URGENT_:Does anybody know how to undo an CVS update ? You can check out based on a date. Or you update to the newest code in CVS. Sorry about the problems. I tested a few things with 2.1 on my machine but not all the modules, and I didn't try under 2.0 or older at all until just now. Andrew From jchang at smi.stanford.edu Wed Jan 16 11:41:34 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com> References: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020116084133.A483@krusty.stanford.edu> Right now, Biopython requires a minimum of Python 2.0. Is it time to up the specs again? What do people think? How much easier do the 2.1/2.2 features make your lives? Where would you use them, and how much backwards compatibility would it break? Jeff From adalke at mindspring.com Wed Jan 16 11:56:44 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes Message-ID: <009601c19eae$c7ca3320$0201a8c0@josiah.dalkescientific.com> >How much easier do the >2.1/2.2 features make your lives? Where would you use them, and how >much backwards compatibility would it break? The features I use in 2.1 are: warnings - to generate warnings weakref - to build complex data structures that are appropriately garbage collected The features I would like to use from 2.2 are: yield - can in cases be done with adapters, but there are a few places where it's nice iterators - have been using adapters, but it's nice I would also like the new __slots__ mechanism in 2.2, but I've been using __getattr__ tricks long enough that that comes naturally to me; but not to others. Andrew From chapmanb at arches.uga.edu Wed Jan 16 12:20:56 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: <20020116084133.A483@krusty.stanford.edu> References: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com> <20020116084133.A483@krusty.stanford.edu> Message-ID: <20020116122056.A56273@ci350185-a.athen1.ga.home.com> Jeff asks: > Right now, Biopython requires a minimum of Python 2.0. Is it time to > up the specs again? What do people think? How much easier do the > 2.1/2.2 features make your lives? Where would you use them, and how > much backwards compatibility would it break? I'm all for moving forward with the minimum requirement if people think it helps them. The number one thing I like about 2.1 is the warning framework. 2.2 has the nice new Iterators and Generators which I would be keen on learning to use. It's probably too early to require 2.2, but starting to require 2.1 would help move us towards requiring 2.2 sooner :-). Newer-is-always-better-than-older-except-when-it-comes-to-bourbon-ly yr's, Brad From adalke at mindspring.com Wed Jan 16 12:39:53 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes Message-ID: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com> Brad: >It's probably too early to require 2.2, but >starting to require 2.1 would help move us towards requiring 2.2 sooner >:-). I agree. If my clients are any judge, they waited until 2.1 to install a 2.x Python, and they don't want to upgrade just yet. Andrew From jchang at smi.stanford.edu Wed Jan 16 12:51:18 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com> References: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com> Message-ID: <20020116095118.A484@krusty.stanford.edu> What about you, Thomas? What version are you using, and what do you think about nudging the the requirements up to 2.1? I'm currently using an alpha of 2.2 and really like the generators. I'm using them in some other code, but haven't required them in biopython, yet. Jeff On Wed, Jan 16, 2002 at 10:39:53AM -0700, Andrew Dalke wrote: > Brad: > >It's probably too early to require 2.2, but > >starting to require 2.1 would help move us towards requiring 2.2 sooner > >:-). > > I agree. If my clients are any judge, they waited until 2.1 > to install a 2.x Python, and they don't want to upgrade just yet. > > Andrew > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev From thomas at cbs.dtu.dk Mon Jan 14 03:15:06 2002 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:09 2005 Subject: ARGHHH [Biopython-dev] changes In-Reply-To: Jeffrey Chang's message of "Wed, 16 Jan 2002 09:51:18 -0800" References: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com> <20020116095118.A484@krusty.stanford.edu> Message-ID: Jeffrey Chang writes: > What about you, Thomas? What version are you using, and what do you > think about nudging the the requirements up to 2.1? > > I'm currently using an alpha of 2.2 and really like the generators. > I'm using them in some other code, but haven't required them in > biopython, yet. I personally always like to play around with the latest toys, but the last halv year I became a little lazy. Mostly because any Python update has to be done at the same time on all of my machines/accounts which runs PyPhy (currently Denmark, Sweden, England, USA and Canada ... ) Right now I'm running 2.0, but I am willing to upgrade .. I dont think we should require 2.1, either we stay at 2.0 or climb the whole way to 2.2. That would require the least inconvenience for users (and administrators) (IMHO) So what are the most cool features in 2.1 ans 2.2 ? cheers -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From katel at worldpath.net Thu Jan 17 18:13:27 2002 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Effective ways to use Martel Message-ID: <001801c19fac$934d6a00$010a0a0a@cadence.com> Background: Andrew took the time to look into a pesky bug in my saf_format( not yet committed ). The consumer occasionally received tags that were sliced at the wrong point. Think of an enzyme that attaches to the wrong base by a few bases. Andrew pointed out a bug where I assumed a name was always 14 characters where it should be a maximum of 14 characters. However this did not explain the bug. Further investigation showed the problem to be a limitation of EventGenerator. My format used embedded tags and EventGenerator does not support them. Andrew recommended his new Dispatch module as an alternative. Shud this be documented? Andrew says in the future Dispatch will be the preferred tool. But without documentation what keeps users from uing the old technique and running into the same issue? Cayte . From adalke at mindspring.com Fri Jan 18 07:36:31 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] my latest trick Message-ID: <002b01c1a01c$c2cc9380$0201a8c0@josiah.dalkescientific.com> The FormatIO system now supports BLAST and WU-BLAST, although it isn't yet finished (missing a few details, like locations and the sequence name). Here's what it looks like First, autodetection of the file format >>> import Bio >>> format = Bio.formats["search"].identify("blastp.2.wu") >>> format.name 'wu-blastp' >>> It isn't figuring things out from the extension - it's testing the contents of the file. Here's proof. >>> import os >>> os.symlink("blastp.2.wu", "unknown.dat") >>> Bio.formats["search"].identify("unknown.dat").name 'wu-blastp' >>> And I can load this into memory as a 'Search' object. >>> from Bio import Search >>> search = Search.io.read("unknown.dat").next() # The 'next' is because the FormatIO system assumes multiple # records. I have an idea of how to fix that. >>> search.query.description 'YEL060C vacuolar protease B' >>> search.algorithm.name 'BLASTP' >>> len(search.hits) 12 >>> >>> for hit in search.hits: ... print hit.name, hit.description ... print len(hit.hsps), "hsps" ... for hsp in hit.hsps: ... print hsp.query.seq ... print hsp.homology.seq ... print hsp.subject.seq ... (Lots printed out) Many of the fields are stored, like >>> for k, v in search.statistics.items(): ... print k, "=", repr(v) ... total_time = ' 0.95u 0.08s 1.03t Elapsed: 00:00:01' num_threads = 1 posted_date = ' 3:27 PM PST Mar 9, 1998' start_time = 'Mon Mar 9 15:57:59 1998' num_dfa_states = ' 569 (112 KB)' total_dfa_size = ' 358 KB (384 KB)' database = ' /tmpblast/PDBUNIQ' release_date = ' unknown' format = ' BLAST' num_sequences_in_database = 2335 num_sequences_satisying_E = 12 end_time = 'Mon Mar 9 15:58:00 1998' neighborhood_generation_time = ' 0.01u 0.00s 0.01t Elapsed: 00:00:00' num_letters_in_database = 479690 search_cpu_time = ' 0.90u 0.05s 0.95t Elapsed: 00:00:01' database_title = ' PDBUNIQ' >>> >>> hsp.query.positives 56 >>> hsp.query.frac_positives 0.45528455284552843 >>> Normalization is still a problem, as you can see from the untrimmed strings. And I don't quite get everything .. it's pretty tedious. (There's a few fields in the hit I also don't handle, like what do I do with 'P(2)' compared to 'P'? Someone with a better technical understanding of the details of the algorithms needs to help me. Perhaps in Tuscon.) Biggest thing missing is failure cases. The data files I found were all for successful runs. The format expressions get hairy. The biggest problems occur when formatY is almost like formatX except for a small change in the bottom of the expression tree. Then the whole tree needs to be reconstructed, which is noisy. I'm thinking about possibilities like blastn = Martel.replace_group(blastp, "hsp_info", blastn_hsp_info) All these tree editing methods are ad hoc. I keep wondering what it would be like to convert the tree to a DOM then use the DOM methods to manipulate the structure. That'll have to wait for Martel version 2. Andrew dalke@dalkescientific.com From Y.Benita at pharm.uu.nl Tue Jan 22 04:05:34 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Bug in urllib of macpython Message-ID: Due to a bug in urllib of biopython most of the www are not functioning well. The bug is known and will be taken care of by the python developers. However, I had to use another function to replace the command urllib.urlopen() The following files were changed: NCBI.py ExPAsy.py InterPro.py SCOP.py Besides these files, is there any other place where the urllib.urlopen command is used? Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From adalke at mindspring.com Tue Jan 22 20:32:34 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] latest checking Message-ID: <00e201c1a3ad$d564b060$0201a8c0@josiah.dalkescientific.com> Okay, I've updated CVS to my latest working version of Martel and the Bioformat code. Most of this commit was related to performance. I can go into details, but it isn't worth it since it isn't there any more. I can currently convert sprot38.dat from SWISS-PROT to a full SeqRecord in just under 20 minutes (extrapolated). This is down from about 28 minutes, so roughly 30% faster. If I really had to, I figured out a hack approach that might shave another few minutes off the time. The ncbi blast parser takes about 4.7 seconds to parse my main test file while Jeff's code takes 3.7 seconds. I managed to whittle it down from 15 seconds, so I'm pretty happy. The hack approach might get me another half second, but I'm not doing quite everything Jeff does so no predictions that I can match his performance. The biggest performance problem is Python's function call overhead. For some thing like Andrew there are normally at least three function calls startElement("name", {}) characters("Andrew") endElement("name) The ContentHandler can be written as a chain of if/else statements, def startElement(self, tag, attrs): if tag == .. ... elif tag == "name": self.s = "" self.save_characters = 1 ... def characters(self, s): if self.save_characters: self.s += s def endElement(self, tag): ... elif tag == "name": print "I have", self.s self.save_characters = 0 The problem with this is the comparison tests ("==") are linear in the number of tags. Even if sorted to present most common tags first, there's still quite a bit of overhead -- perhaps a few dozen equality tests for a handful of lines of "real" code. One thing I tried last year was a "dispatch" handler, which looks like this def startElement(self, tag, attrs): f = getattr(self, "start_" + tag) if f is not None: f(tag, attrs) def endElement(self, tag): f = getattr(self, "end_" + tag) if f is not None: f(tag, attrs) ... def start_name(self, tag, attrs): self.s = "" self.save_characters = 1 def end_name(self, tag): print "I have", self.s self.save_characters = 0 This replaces the equality tests with a getattr - which is a dictionary lookup - and an extra function call. This turns out to be faster, at least in my test. In the latest Martel distribution, I've one-upped this. When the Dispatch.Dispatcher() starts up, it introspects itself and finds all the method names which start with "start_" and "end_", then builds a table mapping tag name to function, so the dispatch doesn't need to create an intermediate string def startElement(self, tag, attrs): f = self._start_table.get(tag) if f is not None: f(tag, attrs) I then tweaked the Martel.Parser so if the ContentHandler is a Dispatcher then specialized parser code reaches into the class to get its _start_table and _end_table. (A la a C++ "friend" class.) This reduces function call overhead in two ways: - there isn't the intermediate startElement/endElement call - if there's no handler for that tag then there are no function calls at all. These tricks really make an appreciable performance difference, don't change the normal API, and isn't all that hard to understand, which is why I can justify breaking some OO boundaries. I think there's one other performance problem in the code, which is that state information is passed around through class attributes. Attribute lookups go through a dict lookup while regular local variables are constant time lookup and quite a bit faster. I can't think of any way to get around this, except that new-style Python objects support __slots__ which might be faster. Andrew dalke@dalkescientific.com From adalke at mindspring.com Tue Jan 22 20:53:32 2002 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Effective ways to use Martel Message-ID: <00e901c1a3b0$c34c1780$0201a8c0@josiah.dalkescientific.com> Cayte: >My format used embedded tags and EventGenerator does not >support them. Andrew recommended his new Dispatch module >as an alternative. The new Dispatch module; an example. Start by reading previous email. Here's a simple format definition for a FASTA file. from Martel import Str, Group, UntilEol, AssertNot, Rep, AnyEol header = (Str(">") + Group("description", UntilEol()) + AnyEol()) seqline = (AssertNot(Str(">")) + Group("sequence", UntilEol()) + AnyEol()) record = Group("record", header + Rep(seqline)) format = Rep(record) Suppose you want to print the sequence length and the header definition. Here's how to do it with the Dispatcher. from Martel import Dispatch class SeqLength(Dispatch.Dispatcher): def start_record(self, tag, attrs): self.seqlen = 0 def start_description(self, tag, attrs): self.save_characters() def end_description(self, tag): self.description = self.get_characters() def start_sequence(self, tag, attrs): self.save_characters() def end_sequence(self, tag): self.seqlen += len(self.get_characters()) def end_record(self, tag): print self.seqlen, "in", self.description This Dispatcher is a regular ContentHandler so is used like this, assuming that "test.fasta" contains a FASTA file. p = format.make_parser() p.setContentHandler(SeqLength()) p.parse(open("test.fasta")) On my test data set, it looks like this: 378 in AK1H_ECOLI/114-431 389 in AKH_HAEIN/114-431 389 in AKH1_MAIZE/117-440 378 in AK2H_ECOLI/112-431 381 in AK1_BACSU/66-374 411 in AK2_BACST/63-370 411 in AK2_BACSU/63-373 411 in AKAB_CORFL/63-379 411 in AKAB_MYCSM/63-379 377 in AK3_ECOLI/106-407 391 in AK_YEAST/134-472 The new thing in this example is the "save_characters()" and "get_characters()". This is a stack-based approach for getting all the characters between a start-tag and and end-tag. So long as the calls are balanced then many different elements can get characters without trouncing on each other's feet. Hmmm, need an example which shows this support for overlaps. > Shud this be documented? Andrew says in the future Dispatch >will be the preferred tool. But without documentation what >keeps users from uing the >old technique and running into the same issue? Yes, it should be documented. It also depends on if the work I've been doing has gotten to the point where we can start thinking about deprecating the existing code. In which case the documentation is easy - at the top of the module say "DEPRECATED - SEE XXX.py" Andrew dalke@dalkescientific.com From katel at worldpath.net Wed Jan 23 00:01:01 2002 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:09 2005 Subject: [Biopython-dev] Effective ways to use Martel References: <00e901c1a3b0$c34c1780$0201a8c0@josiah.dalkescientific.com> Message-ID: <004801c1a3ca$f51fae60$010a0a0a@cadence.com> ----- Original Message ----- From: "Andrew Dalke" To: Sent: Tuesday, January 22, 2002 5:53 PM Subject: Re: [Biopython-dev] Effective ways to use Martel > Cayte: > >My format used embedded tags and EventGenerator does not > >support them. Andrew recommended his new Dispatch module > >as an alternative. > I plan to use Dispatch for saf, but before you got back to me I started porting the ECell perl script. I'd like to finish while perl is fresh in my mind because I have to relearn the cryptic codes every time I revisit perl.. Cayte From mark at acoma.Stanford.EDU Wed Jan 23 15:36:51 2002 From: mark at acoma.Stanford.EDU (Mark Lambrecht) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Prosite Message-ID: Hi, Thanks for all the excellent Biopython code. I used the Prosite parser and it breaks on a number of CC and MA lines. Maybe there is a new version of the prosite.dat file ? We added some code to the Bio/Prosite/__init__.py , and commented it with ## (lambrecht/dyoo) Then everything works again but possibly doesn't use the information in these lines. I attached the __init__.py Could you take a look ? Thanks !! Mark -------------------------------------------------------------------------- Mark Lambrecht Postdoctoral Research Fellow The Arabidopsis Information Resource FAX: (650) 325-6857 Carnegie Institution of Washington Tel: (650) 325-1521 ext.397 Department of Plant Biology URL: http://arabidopsis.org/ 260 Panama St. Stanford, CA 94305 -------------------------------------------------------------------------- -------------- next part -------------- # Copyright 1999 by Jeffrey Chang. All rights reserved. # This code is part of the Biopython distribution and governed by its # license. Please see the LICENSE file that should have been included # as part of this package. # Copyright 2000 by Jeffrey Chang. All rights reserved. # This code is part of the Biopython distribution and governed by its # license. Please see the LICENSE file that should have been included # as part of this package. """Prosite This module provides code to work with the prosite.dat file from Prosite. http://www.expasy.ch/prosite/ Tested with: Release 15.0, July 1998 Release 16.0, July 1999 Classes: Record Holds Prosite data. PatternHit Holds data from a hit against a Prosite pattern. Iterator Iterates over entries in a Prosite file. Dictionary Accesses a Prosite file using a dictionary interface. ExPASyDictionary Accesses Prosite records from ExPASy. RecordParser Parses a Prosite record into a Record object. _Scanner Scans Prosite-formatted data. _RecordConsumer Consumes Prosite data to a Record object. Functions: scan_sequence_expasy Scan a sequence for occurrences of Prosite patterns. index_file Index a Prosite file for a Dictionary. _extract_record Extract Prosite data from a web page. _extract_pattern_hits Extract Prosite patterns from a web page. """ __all__ = [ 'Pattern', 'Prodoc', ] from types import * import string import re import sgmllib from Bio import File from Bio import Index from Bio.ParserSupport import * from Bio.WWW import ExPASy from Bio.WWW import RequestLimiter class Record: """Holds information from a Prosite record. Members: name ID of the record. e.g. ADH_ZINC type Type of entry. e.g. PATTERN, MATRIX, or RULE accession e.g. PS00387 created Date the entry was created. (MMM-YYYY) data_update Date the 'primary' data was last updated. info_update Date data other than 'primary' data was last updated. pdoc ID of the PROSITE DOCumentation. description Free-format description. pattern The PROSITE pattern. See docs. matrix List of strings that describes a matrix entry. rules List of rule definitions. (strings) NUMERICAL RESULTS nr_sp_release SwissProt release. nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) nr_positive True positives. tuple of (hits, seqs) nr_unknown Could be positives. tuple of (hits, seqs) nr_false_pos False positives. tuple of (hits, seqs) nr_false_neg False negatives. (int) nr_partial False negatives, because they are fragments. (int) COMMENTS cc_taxo_range Taxonomic range. See docs for format cc_max_repeat Maximum number of repetitions in a protein cc_site Interesting site. list of tuples (pattern pos, desc.) cc_skip_flag Can this entry be ignored? DATA BANK REFERENCES - The following are all lists of tuples (swiss-prot accession, swiss-prot name) dr_positive dr_false_neg dr_false_pos dr_potential Potential hits, but fingerprint region not yet available. dr_unknown Could possibly belong pdb_structs List of PDB entries. """ def __init__(self): self.name = '' self.type = '' self.accession = '' self.created = '' self.data_update = '' self.info_update = '' self.pdoc = '' self.description = '' self.pattern = '' self.matrix = [] self.rules = [] self.nr_sp_release = '' self.nr_sp_seqs = '' self.nr_total = (None, None) self.nr_positive = (None, None) self.nr_unknown = (None, None) self.nr_false_pos = (None, None) self.nr_false_neg = None self.nr_partial = None self.cc_taxo_range = '' self.cc_max_repeat = '' self.cc_site = [] self.cc_skip_flag = '' self.dr_positive = [] self.dr_false_neg = [] self.dr_false_pos = [] self.dr_potential = [] self.dr_unknown = [] self.pdb_structs = [] class PatternHit: """Holds information from a hit against a Prosite pattern. Members: name ID of the record. e.g. ADH_ZINC accession e.g. PS00387 pdoc ID of the PROSITE DOCumentation. description Free-format description. matches List of tuples (start, end, sequence) where start and end are indexes of the match, and sequence is the sequence matched. """ def __init__(self): self.name = None self.accession = None self.pdoc = None self.description = None self.matches = [] def __str__(self): lines = [] lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) lines.append(self.description) lines.append('') if len(self.matches) > 1: lines.append("Number of matches: %s" % len(self.matches)) for i in range(len(self.matches)): start, end, seq = self.matches[i] range_str = "%d-%d" % (start, end) if len(self.matches) > 1: lines.append("%7d %10s %s" % (i+1, range_str, seq)) else: lines.append("%7s %10s %s" % (' ', range_str, seq)) return string.join(lines, '\n') class Iterator: """Returns one record at a time from a Prosite file. Methods: next Return the next record from the stream, or None. """ def __init__(self, handle, parser=None): """__init__(self, handle, parser=None) Create a new iterator. handle is a file-like object. parser is an optional Parser object to change the results into another form. If set to None, then the raw contents of the file will be returned. """ if type(handle) is not FileType and type(handle) is not InstanceType: raise ValueError, "I expected a file handle or file-like object" self._uhandle = File.UndoHandle(handle) self._parser = parser def next(self): """next(self) -> object Return the next Prosite record from the file. If no more records, return None. """ # Skip the copyright info, if it's the first record. line = self._uhandle.peekline() if line[:2] == 'CC': while 1: line = self._uhandle.readline() if not line: break if line[:2] == '//': break if line[:2] != 'CC': raise SyntaxError, \ "Oops, where's the copyright?" lines = [] while 1: line = self._uhandle.readline() if not line: break lines.append(line) if line[:2] == '//': break if not lines: return None data = string.join(lines, '') if self._parser is not None: return self._parser.parse(File.StringHandle(data)) return data class Dictionary: """Accesses a Prosite file using a dictionary interface. """ __filename_key = '__filename' def __init__(self, indexname, parser=None): """__init__(self, indexname, parser=None) Open a Prosite Dictionary. indexname is the name of the index for the dictionary. The index should have been created using the index_file function. parser is an optional Parser object to change the results into another form. If set to None, then the raw contents of the file will be returned. """ self._index = Index.Index(indexname) self._handle = open(self._index[Dictionary.__filename_key]) self._parser = parser def __len__(self): return len(self._index) def __getitem__(self, key): start, len = self._index[key] self._handle.seek(start) data = self._handle.read(len) if self._parser is not None: return self._parser.parse(File.StringHandle(data)) return data def __getattr__(self, name): return getattr(self._index, name) class ExPASyDictionary: """Access PROSITE at ExPASy using a read-only dictionary interface. """ def __init__(self, delay=5.0, parser=None): """__init__(self, delay=5.0, parser=None) Create a new Dictionary to access PROSITE. parser is an optional parser (e.g. Prosite.RecordParser) object to change the results into another form. If set to None, then the raw contents of the file will be returned. delay is the number of seconds to wait between each query. """ self.parser = parser self.limiter = RequestLimiter(delay) def __len__(self): raise NotImplementedError, "Prosite contains lots of entries" def clear(self): raise NotImplementedError, "This is a read-only dictionary" def __setitem__(self, key, item): raise NotImplementedError, "This is a read-only dictionary" def update(self): raise NotImplementedError, "This is a read-only dictionary" def copy(self): raise NotImplementedError, "You don't need to do this..." def keys(self): raise NotImplementedError, "You don't really want to do this..." def items(self): raise NotImplementedError, "You don't really want to do this..." def values(self): raise NotImplementedError, "You don't really want to do this..." def has_key(self, id): """has_key(self, id) -> bool""" try: self[id] except KeyError: return 0 return 1 def get(self, id, failobj=None): try: return self[id] except KeyError: return failobj raise "How did I get here?" def __getitem__(self, id): """__getitem__(self, id) -> object Return a Prosite entry. id is either the id or accession for the entry. Raises a KeyError if there's an error. """ # First, check to see if enough time has passed since my # last query. self.limiter.wait() try: handle = ExPASy.get_prosite_entry(id) except IOError: raise KeyError, id try: handle = File.StringHandle(_extract_record(handle)) except ValueError: raise KeyError, id if self.parser is not None: return self.parser.parse(handle) return handle.read() class RecordParser(AbstractParser): """Parses Prosite data into a Record object. """ def __init__(self): self._scanner = _Scanner() self._consumer = _RecordConsumer() def parse(self, handle): self._scanner.feed(handle, self._consumer) return self._consumer.data class _Scanner: """Scans Prosite-formatted data. Tested with: Release 15.0, July 1998 """ def feed(self, handle, consumer): """feed(self, handle, consumer) Feed in Prosite data for scanning. handle is a file-like object that contains prosite data. consumer is a Consumer object that will receive events as the report is scanned. """ if isinstance(handle, File.UndoHandle): uhandle = handle else: uhandle = File.UndoHandle(handle) while 1: line = uhandle.peekline() if not line: break elif is_blank_line(line): # Skip blank lines between records uhandle.readline() continue elif line[:2] == 'ID': self._scan_record(uhandle, consumer) elif line[:2] == 'CC': self._scan_copyrights(uhandle, consumer) else: raise SyntaxError, "There doesn't appear to be a record" def _scan_copyrights(self, uhandle, consumer): consumer.start_copyrights() self._scan_line('CC', uhandle, consumer.copyright, any_number=1) self._scan_terminator(uhandle, consumer) consumer.end_copyrights() def _scan_record(self, uhandle, consumer): consumer.start_record() for fn in self._scan_fns: fn(self, uhandle, consumer) # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before # the 3D lines, instead of the other way around. # Thus, I'll give the 3D lines another chance after the DO lines # are finished. if fn is self._scan_do.im_func: self._scan_3d(uhandle, consumer) consumer.end_record() def _scan_line(self, line_type, uhandle, event_fn, exactly_one=None, one_or_more=None, any_number=None, up_to_one=None): # Callers must set exactly one of exactly_one, one_or_more, or # any_number to a true value. I do not explicitly check to # make sure this function is called correctly. # This does not guarantee any parameter safety, but I # like the readability. The other strategy I tried was have # parameters min_lines, max_lines. if exactly_one or one_or_more: read_and_call(uhandle, event_fn, start=line_type) if one_or_more or any_number: while 1: if not attempt_read_and_call(uhandle, event_fn, start=line_type): break if up_to_one: attempt_read_and_call(uhandle, event_fn, start=line_type) def _scan_id(self, uhandle, consumer): self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) def _scan_ac(self, uhandle, consumer): self._scan_line('AC', uhandle, consumer.accession, exactly_one=1) def _scan_dt(self, uhandle, consumer): self._scan_line('DT', uhandle, consumer.date, exactly_one=1) def _scan_de(self, uhandle, consumer): self._scan_line('DE', uhandle, consumer.description, exactly_one=1) def _scan_pa(self, uhandle, consumer): self._scan_line('PA', uhandle, consumer.pattern, any_number=1) def _scan_ma(self, uhandle, consumer): # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 # contain a CC line buried within an 'MA' line. Need to check # for that. while 1: if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): line1 = uhandle.readline() line2 = uhandle.readline() uhandle.saveline(line2) uhandle.saveline(line1) if line1[:2] == 'CC' and line2[:2] == 'MA': read_and_call(uhandle, consumer.comment, start='CC') else: break def _scan_ru(self, uhandle, consumer): self._scan_line('RU', uhandle, consumer.rule, any_number=1) def _scan_nr(self, uhandle, consumer): self._scan_line('NR', uhandle, consumer.numerical_results, any_number=1) def _scan_cc(self, uhandle, consumer): self._scan_line('CC', uhandle, consumer.comment, any_number=1) def _scan_dr(self, uhandle, consumer): self._scan_line('DR', uhandle, consumer.database_reference, any_number=1) def _scan_3d(self, uhandle, consumer): self._scan_line('3D', uhandle, consumer.pdb_reference, any_number=1) def _scan_do(self, uhandle, consumer): self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) def _scan_terminator(self, uhandle, consumer): self._scan_line('//', uhandle, consumer.terminator, exactly_one=1) _scan_fns = [ _scan_id, _scan_ac, _scan_dt, _scan_de, _scan_pa, _scan_ma, _scan_ru, _scan_nr, _scan_ma, ## (lambrecht/dyoo) is this right? _scan_nr, ## (lambrecht/dyoo) is this right? _scan_cc, _scan_dr, _scan_3d, _scan_do, _scan_terminator ] class _RecordConsumer(AbstractConsumer): """Consumer that converts a Prosite record to a Record object. Members: data Record with Prosite data. """ def __init__(self): self.data = None def start_record(self): self.data = Record() def end_record(self): self._clean_record(self.data) def identification(self, line): cols = string.split(line) if len(cols) != 3: raise SyntaxError, "I don't understand identification line\n%s" % \ line self.data.name = self._chomp(cols[1]) # don't want ';' self.data.type = self._chomp(cols[2]) # don't want '.' def accession(self, line): cols = string.split(line) if len(cols) != 2: raise SyntaxError, "I don't understand accession line\n%s" % line self.data.accession = self._chomp(cols[1]) def date(self, line): uprline = string.upper(line) cols = string.split(uprline) # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' if cols[2] != '(CREATED);' or \ cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': raise SyntaxError, "I don't understand date line\n%s" % line self.data.created = cols[1] self.data.data_update = cols[3] self.data.info_update = cols[6] def description(self, line): self.data.description = self._clean(line) def pattern(self, line): self.data.pattern = self.data.pattern + self._clean(line) def matrix(self, line): self.data.matrix.append(self._clean(line)) def rule(self, line): self.data.rules.append(self._clean(line)) def numerical_results(self, line): cols = string.split(self._clean(line), ';') for col in cols: if not col: continue qual, data = map(string.lstrip, string.split(col, '=')) if qual == '/RELEASE': release, seqs = string.split(data, ',') self.data.nr_sp_release = release self.data.nr_sp_seqs = int(seqs) elif qual == '/FALSE_NEG': self.data.nr_false_neg = int(data) elif qual == '/PARTIAL': self.data.nr_partial = int(data) ## (lambrecht/dyoo) added temporary fix for qual //MATRIX_TYPE in CC elif qual =='/MATRIX_TYPE': pass elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: m = re.match(r'(\d+)\((\d+)\)', data) if not m: raise error, "Broken data %s in comment line\n%s" % \ (repr(data), line) hits = tuple(map(int, m.groups())) if(qual == "/TOTAL"): self.data.nr_total = hits elif(qual == "/POSITIVE"): self.data.nr_positive = hits elif(qual == "/UNKNOWN"): self.data.nr_unknown = hits elif(qual == "/FALSE_POS"): self.data.nr_false_pos = hits else: raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ (repr(qual), line) def comment(self, line): cols = string.split(self._clean(line), ';') for col in cols: # DNAJ_2 in Release 15 has a non-standard comment line: # CC Automatic scaling using reversed database # Throw it away. (Should I keep it?) if not col or col[:17] == 'Automatic scaling': continue qual, data = map(string.lstrip, string.split(col, '=')) if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR', '/FT_KEY', '/FT_DESC'): continue ## (lambrecht/dyoo) This is a temporary fix until we know what ## to do here if qual == '/TAXO-RANGE': self.data.cc_taxo_range = data elif qual == '/MAX-REPEAT': self.data.cc_max_repeat = data elif qual == '/SITE': pos, desc = string.split(data, ',') self.data.cc_site = (int(pos), desc) elif qual == '/SKIP-FLAG': self.data.cc_skip_flag = data else: raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ (repr(qual), line) def database_reference(self, line): refs = string.split(self._clean(line), ';') for ref in refs: if not ref: continue acc, name, type = map(string.strip, string.split(ref, ',')) if type == 'T': self.data.dr_positive.append((acc, name)) elif type == 'F': self.data.dr_false_pos.append((acc, name)) elif type == 'N': self.data.dr_false_neg.append((acc, name)) elif type == 'P': self.data.dr_potential.append((acc, name)) elif type == '?': self.data.dr_unknown.append((acc, name)) else: raise SyntaxError, "I don't understand type flag %s" % type def pdb_reference(self, line): cols = string.split(line) for id in cols[1:]: # get all but the '3D' col self.data.pdb_structs.append(self._chomp(id)) def documentation(self, line): self.data.pdoc = self._chomp(self._clean(line)) def terminator(self, line): pass def _chomp(self, word, to_chomp='.,;'): # Remove the punctuation at the end of a word. if word[-1] in to_chomp: return word[:-1] return word def _clean(self, line, rstrip=1): # Clean up a line. if rstrip: return string.rstrip(line[5:]) return line[5:] def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None): """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> list of PatternHit's Search a sequence for occurrences of Prosite patterns. You can specify either a sequence in seq or a SwissProt/trEMBL ID or accession in id. Only one of those should be given. If exclude_frequent is true, then the patterns with the high probability of occurring will be excluded. """ if (seq and id) or not (seq or id): raise ValueError, "Please specify either a sequence or an id" handle = ExPASy.scanprosite1(seq, id, exclude_frequent) return _extract_pattern_hits(handle) def _extract_pattern_hits(handle): """_extract_pattern_hits(handle) -> list of PatternHit's Extract hits from a web page. Raises a ValueError if there was an error in the query. """ class parser(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self.hits = [] self.broken_message = 'Some error occurred' self._in_pre = 0 self._current_hit = None self._last_found = None # Save state of parsing def handle_data(self, data): if string.find(data, 'try again') >= 0: self.broken_message = data return elif data == 'illegal': self.broken_message = 'Sequence contains illegal characters' return if not self._in_pre: return elif not string.strip(data): return if self._last_found is None and data[:4] == 'PDOC': self._current_hit.pdoc = data self._last_found = 'pdoc' elif self._last_found == 'pdoc': if data[:2] != 'PS': raise SyntaxError, "Expected accession but got:\n%s" % data self._current_hit.accession = data self._last_found = 'accession' elif self._last_found == 'accession': self._current_hit.name = data self._last_found = 'name' elif self._last_found == 'name': self._current_hit.description = data self._last_found = 'description' elif self._last_found == 'description': m = re.findall(r'(\d+)-(\d+) (\w+)', data) for start, end, seq in m: self._current_hit.matches.append( (int(start), int(end), seq)) def do_hr(self, attrs): #
inside a
 section means a new hit.
            if self._in_pre:
                self._current_hit = PatternHit()
                self.hits.append(self._current_hit)
                self._last_found = None
        def start_pre(self, attrs):
            self._in_pre = 1
            self.broken_message = None   # Probably not broken
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if p.broken_message:
        raise ValueError, p.broken_message
    return p.hits


        
    
def index_file(filename, indexname, rec2key=None):
    """index_file(filename, indexname, rec2key=None)

    Index a Prosite file.  filename is the name of the file.
    indexname is the name of the dictionary.  rec2key is an
    optional callback that takes a Record and generates a unique key
    (e.g. the accession number) for the record.  If not specified,
    the id name will be used.

    """
    if not os.path.exists(filename):
        raise ValueError, "%s does not exist" % filename

    index = Index.Index(indexname, truncate=1)
    index[Dictionary._Dictionary__filename_key] = filename
    
    iter = Iterator(open(filename), parser=RecordParser())
    while 1:
        start = iter._uhandle.tell()
        rec = iter.next()
        length = iter._uhandle.tell() - start
        
        if rec is None:
            break
        if rec2key is not None:
            key = rec2key(rec)
        else:
            key = rec.name
            
        if not key:
            raise KeyError, "empty key was produced"
        elif index.has_key(key):
            raise KeyError, "duplicate key %s found" % key

        index[key] = start, length

def _extract_record(handle):
    """_extract_record(handle) -> str

    Extract PROSITE data from a web page.  Raises a ValueError if no
    data was found in the web page.

    """
    # All the data appears between tags:
    # 
ID   NIR_SIR; PATTERN.
    # 
class parser(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self._in_pre = 0 self.data = [] def handle_data(self, data): if self._in_pre: self.data.append(data) def do_br(self, attrs): if self._in_pre: self.data.append('\n') def start_pre(self, attrs): self._in_pre = 1 def end_pre(self): self._in_pre = 0 p = parser() p.feed(handle.read()) if not p.data: raise ValueError, "No data found in web page." return string.join(p.data, '') From jchang at smi.stanford.edu Wed Jan 23 16:27:31 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Bug in urllib of macpython In-Reply-To: References: Message-ID: <20020123132731.B578@krusty.stanford.edu> Doing a quick grep through the .py files in the current CVS, it looks like the only other file to use it is FormatIO. What is your workaround? Jeff On Tue, Jan 22, 2002 at 10:05:34AM +0100, Yair Benita wrote: > Due to a bug in urllib of biopython most of the www are not functioning > well. The bug is known and will be taken care of by the python developers. > However, I had to use another function to replace the command > urllib.urlopen() > > The following files were changed: > NCBI.py > ExPAsy.py > InterPro.py > SCOP.py > > Besides these files, is there any other place where the urllib.urlopen > command is used? > > Yair > -- > Yair Benita > Pharmaceutical Proteomics > Utrecht University > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev From Y.Benita at pharm.uu.nl Thu Jan 24 04:17:08 2002 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Bug in urllib of macpython In-Reply-To: <20020123132731.B578@krusty.stanford.edu> Message-ID: on 23/1/2002 22:27, Jeffrey Chang at jchang@smi.stanford.edu wrote: > Doing a quick grep through the .py files in the current CVS, it looks > like the only other file to use it is FormatIO. > > What is your workaround? In the begging I had a while loop that waits till the file is fully downloaded. Now I have an easier solution: Instead of: handle = urllib.urlopen(fullcgi) I use: handle = open(urllib.urlretrieve(fullcgi)[0]) It appears to work fine now. Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From jchang at smi.stanford.edu Thu Jan 24 17:28:01 2002 From: jchang at smi.stanford.edu (Jeffrey Chang) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Prosite In-Reply-To: References: Message-ID: <20020124142801.D372@krusty.stanford.edu> Yep, it looks like Release 17 from last month introduced some format changes that broke the parser. I've updated the parser to handle the new lines -- __init__.py is attached. Please try this out and let me know how it works. Thanks for the report and the patch! Jeff On Wed, Jan 23, 2002 at 12:36:51PM -0800, Mark Lambrecht wrote: > Hi, > > Thanks for all the excellent Biopython code. > I used the Prosite parser and it breaks on a number of CC and MA lines. > Maybe there is a new version of the prosite.dat file ? > We added some code to the Bio/Prosite/__init__.py , and commented it with > ## (lambrecht/dyoo) > Then everything works again but possibly doesn't use the information in > these lines. > I attached the __init__.py > Could you take a look ? > > Thanks !! > > Mark > > > -------------------------------------------------------------------------- > Mark Lambrecht > Postdoctoral Research Fellow > The Arabidopsis Information Resource FAX: (650) 325-6857 > Carnegie Institution of Washington Tel: (650) 325-1521 ext.397 > Department of Plant Biology URL: http://arabidopsis.org/ > 260 Panama St. > Stanford, CA 94305 > -------------------------------------------------------------------------- > # Copyright 1999 by Jeffrey Chang. All rights reserved. > # This code is part of the Biopython distribution and governed by its > # license. Please see the LICENSE file that should have been included > # as part of this package. > > # Copyright 2000 by Jeffrey Chang. All rights reserved. > # This code is part of the Biopython distribution and governed by its > # license. Please see the LICENSE file that should have been included > # as part of this package. > > """Prosite > > This module provides code to work with the prosite.dat file from > Prosite. > http://www.expasy.ch/prosite/ > > Tested with: > Release 15.0, July 1998 > Release 16.0, July 1999 > > > Classes: > Record Holds Prosite data. > PatternHit Holds data from a hit against a Prosite pattern. > Iterator Iterates over entries in a Prosite file. > Dictionary Accesses a Prosite file using a dictionary interface. > ExPASyDictionary Accesses Prosite records from ExPASy. > RecordParser Parses a Prosite record into a Record object. > > _Scanner Scans Prosite-formatted data. > _RecordConsumer Consumes Prosite data to a Record object. > > > Functions: > scan_sequence_expasy Scan a sequence for occurrences of Prosite patterns. > index_file Index a Prosite file for a Dictionary. > _extract_record Extract Prosite data from a web page. > _extract_pattern_hits Extract Prosite patterns from a web page. > > """ > __all__ = [ > 'Pattern', > 'Prodoc', > ] > from types import * > import string > import re > import sgmllib > from Bio import File > from Bio import Index > from Bio.ParserSupport import * > from Bio.WWW import ExPASy > from Bio.WWW import RequestLimiter > > class Record: > """Holds information from a Prosite record. > > Members: > name ID of the record. e.g. ADH_ZINC > type Type of entry. e.g. PATTERN, MATRIX, or RULE > accession e.g. PS00387 > created Date the entry was created. (MMM-YYYY) > data_update Date the 'primary' data was last updated. > info_update Date data other than 'primary' data was last updated. > pdoc ID of the PROSITE DOCumentation. > > description Free-format description. > pattern The PROSITE pattern. See docs. > matrix List of strings that describes a matrix entry. > rules List of rule definitions. (strings) > > NUMERICAL RESULTS > nr_sp_release SwissProt release. > nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) > nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) > nr_positive True positives. tuple of (hits, seqs) > nr_unknown Could be positives. tuple of (hits, seqs) > nr_false_pos False positives. tuple of (hits, seqs) > nr_false_neg False negatives. (int) > nr_partial False negatives, because they are fragments. (int) > > COMMENTS > cc_taxo_range Taxonomic range. See docs for format > cc_max_repeat Maximum number of repetitions in a protein > cc_site Interesting site. list of tuples (pattern pos, desc.) > cc_skip_flag Can this entry be ignored? > > DATA BANK REFERENCES - The following are all > lists of tuples (swiss-prot accession, > swiss-prot name) > dr_positive > dr_false_neg > dr_false_pos > dr_potential Potential hits, but fingerprint region not yet available. > dr_unknown Could possibly belong > > pdb_structs List of PDB entries. > > """ > def __init__(self): > self.name = '' > self.type = '' > self.accession = '' > self.created = '' > self.data_update = '' > self.info_update = '' > self.pdoc = '' > > self.description = '' > self.pattern = '' > self.matrix = [] > self.rules = [] > > self.nr_sp_release = '' > self.nr_sp_seqs = '' > self.nr_total = (None, None) > self.nr_positive = (None, None) > self.nr_unknown = (None, None) > self.nr_false_pos = (None, None) > self.nr_false_neg = None > self.nr_partial = None > > self.cc_taxo_range = '' > self.cc_max_repeat = '' > self.cc_site = [] > self.cc_skip_flag = '' > > self.dr_positive = [] > self.dr_false_neg = [] > self.dr_false_pos = [] > self.dr_potential = [] > self.dr_unknown = [] > > self.pdb_structs = [] > > class PatternHit: > """Holds information from a hit against a Prosite pattern. > > Members: > name ID of the record. e.g. ADH_ZINC > accession e.g. PS00387 > pdoc ID of the PROSITE DOCumentation. > description Free-format description. > matches List of tuples (start, end, sequence) where > start and end are indexes of the match, and sequence is > the sequence matched. > > """ > def __init__(self): > self.name = None > self.accession = None > self.pdoc = None > self.description = None > self.matches = [] > def __str__(self): > lines = [] > lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) > lines.append(self.description) > lines.append('') > if len(self.matches) > 1: > lines.append("Number of matches: %s" % len(self.matches)) > for i in range(len(self.matches)): > start, end, seq = self.matches[i] > range_str = "%d-%d" % (start, end) > if len(self.matches) > 1: > lines.append("%7d %10s %s" % (i+1, range_str, seq)) > else: > lines.append("%7s %10s %s" % (' ', range_str, seq)) > return string.join(lines, '\n') > > class Iterator: > """Returns one record at a time from a Prosite file. > > Methods: > next Return the next record from the stream, or None. > > """ > def __init__(self, handle, parser=None): > """__init__(self, handle, parser=None) > > Create a new iterator. handle is a file-like object. parser > is an optional Parser object to change the results into another form. > If set to None, then the raw contents of the file will be returned. > > """ > if type(handle) is not FileType and type(handle) is not InstanceType: > raise ValueError, "I expected a file handle or file-like object" > self._uhandle = File.UndoHandle(handle) > self._parser = parser > > def next(self): > """next(self) -> object > > Return the next Prosite record from the file. If no more records, > return None. > > """ > # Skip the copyright info, if it's the first record. > line = self._uhandle.peekline() > if line[:2] == 'CC': > while 1: > line = self._uhandle.readline() > if not line: > break > if line[:2] == '//': > break > if line[:2] != 'CC': > raise SyntaxError, \ > "Oops, where's the copyright?" > > lines = [] > while 1: > line = self._uhandle.readline() > if not line: > break > lines.append(line) > if line[:2] == '//': > break > > if not lines: > return None > > data = string.join(lines, '') > if self._parser is not None: > return self._parser.parse(File.StringHandle(data)) > return data > > class Dictionary: > """Accesses a Prosite file using a dictionary interface. > > """ > __filename_key = '__filename' > > def __init__(self, indexname, parser=None): > """__init__(self, indexname, parser=None) > > Open a Prosite Dictionary. indexname is the name of the > index for the dictionary. The index should have been created > using the index_file function. parser is an optional Parser > object to change the results into another form. If set to None, > then the raw contents of the file will be returned. > > """ > self._index = Index.Index(indexname) > self._handle = open(self._index[Dictionary.__filename_key]) > self._parser = parser > > def __len__(self): > return len(self._index) > > def __getitem__(self, key): > start, len = self._index[key] > self._handle.seek(start) > data = self._handle.read(len) > if self._parser is not None: > return self._parser.parse(File.StringHandle(data)) > return data > > def __getattr__(self, name): > return getattr(self._index, name) > > class ExPASyDictionary: > """Access PROSITE at ExPASy using a read-only dictionary interface. > > """ > def __init__(self, delay=5.0, parser=None): > """__init__(self, delay=5.0, parser=None) > > Create a new Dictionary to access PROSITE. parser is an optional > parser (e.g. Prosite.RecordParser) object to change the results > into another form. If set to None, then the raw contents of the > file will be returned. delay is the number of seconds to wait > between each query. > > """ > self.parser = parser > self.limiter = RequestLimiter(delay) > > def __len__(self): > raise NotImplementedError, "Prosite contains lots of entries" > def clear(self): > raise NotImplementedError, "This is a read-only dictionary" > def __setitem__(self, key, item): > raise NotImplementedError, "This is a read-only dictionary" > def update(self): > raise NotImplementedError, "This is a read-only dictionary" > def copy(self): > raise NotImplementedError, "You don't need to do this..." > def keys(self): > raise NotImplementedError, "You don't really want to do this..." > def items(self): > raise NotImplementedError, "You don't really want to do this..." > def values(self): > raise NotImplementedError, "You don't really want to do this..." > > def has_key(self, id): > """has_key(self, id) -> bool""" > try: > self[id] > except KeyError: > return 0 > return 1 > > def get(self, id, failobj=None): > try: > return self[id] > except KeyError: > return failobj > raise "How did I get here?" > > def __getitem__(self, id): > """__getitem__(self, id) -> object > > Return a Prosite entry. id is either the id or accession > for the entry. Raises a KeyError if there's an error. > > """ > # First, check to see if enough time has passed since my > # last query. > self.limiter.wait() > > try: > handle = ExPASy.get_prosite_entry(id) > except IOError: > raise KeyError, id > try: > handle = File.StringHandle(_extract_record(handle)) > except ValueError: > raise KeyError, id > > if self.parser is not None: > return self.parser.parse(handle) > return handle.read() > > class RecordParser(AbstractParser): > """Parses Prosite data into a Record object. > > """ > def __init__(self): > self._scanner = _Scanner() > self._consumer = _RecordConsumer() > > def parse(self, handle): > self._scanner.feed(handle, self._consumer) > return self._consumer.data > > class _Scanner: > """Scans Prosite-formatted data. > > Tested with: > Release 15.0, July 1998 > > """ > def feed(self, handle, consumer): > """feed(self, handle, consumer) > > Feed in Prosite data for scanning. handle is a file-like > object that contains prosite data. consumer is a > Consumer object that will receive events as the report is scanned. > > """ > if isinstance(handle, File.UndoHandle): > uhandle = handle > else: > uhandle = File.UndoHandle(handle) > > while 1: > line = uhandle.peekline() > if not line: > break > elif is_blank_line(line): > # Skip blank lines between records > uhandle.readline() > continue > elif line[:2] == 'ID': > self._scan_record(uhandle, consumer) > elif line[:2] == 'CC': > self._scan_copyrights(uhandle, consumer) > else: > raise SyntaxError, "There doesn't appear to be a record" > > def _scan_copyrights(self, uhandle, consumer): > consumer.start_copyrights() > self._scan_line('CC', uhandle, consumer.copyright, any_number=1) > self._scan_terminator(uhandle, consumer) > consumer.end_copyrights() > > def _scan_record(self, uhandle, consumer): > consumer.start_record() > for fn in self._scan_fns: > fn(self, uhandle, consumer) > > # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before > # the 3D lines, instead of the other way around. > # Thus, I'll give the 3D lines another chance after the DO lines > # are finished. > if fn is self._scan_do.im_func: > self._scan_3d(uhandle, consumer) > consumer.end_record() > > def _scan_line(self, line_type, uhandle, event_fn, > exactly_one=None, one_or_more=None, any_number=None, > up_to_one=None): > # Callers must set exactly one of exactly_one, one_or_more, or > # any_number to a true value. I do not explicitly check to > # make sure this function is called correctly. > > # This does not guarantee any parameter safety, but I > # like the readability. The other strategy I tried was have > # parameters min_lines, max_lines. > > if exactly_one or one_or_more: > read_and_call(uhandle, event_fn, start=line_type) > if one_or_more or any_number: > while 1: > if not attempt_read_and_call(uhandle, event_fn, > start=line_type): > break > if up_to_one: > attempt_read_and_call(uhandle, event_fn, start=line_type) > > def _scan_id(self, uhandle, consumer): > self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) > > def _scan_ac(self, uhandle, consumer): > self._scan_line('AC', uhandle, consumer.accession, exactly_one=1) > > def _scan_dt(self, uhandle, consumer): > self._scan_line('DT', uhandle, consumer.date, exactly_one=1) > > def _scan_de(self, uhandle, consumer): > self._scan_line('DE', uhandle, consumer.description, exactly_one=1) > > def _scan_pa(self, uhandle, consumer): > self._scan_line('PA', uhandle, consumer.pattern, any_number=1) > > def _scan_ma(self, uhandle, consumer): > # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 > # contain a CC line buried within an 'MA' line. Need to check > # for that. > while 1: > if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): > line1 = uhandle.readline() > line2 = uhandle.readline() > uhandle.saveline(line2) > uhandle.saveline(line1) > if line1[:2] == 'CC' and line2[:2] == 'MA': > read_and_call(uhandle, consumer.comment, start='CC') > else: > break > > def _scan_ru(self, uhandle, consumer): > self._scan_line('RU', uhandle, consumer.rule, any_number=1) > > def _scan_nr(self, uhandle, consumer): > self._scan_line('NR', uhandle, consumer.numerical_results, > any_number=1) > > def _scan_cc(self, uhandle, consumer): > self._scan_line('CC', uhandle, consumer.comment, any_number=1) > > def _scan_dr(self, uhandle, consumer): > self._scan_line('DR', uhandle, consumer.database_reference, > any_number=1) > > def _scan_3d(self, uhandle, consumer): > self._scan_line('3D', uhandle, consumer.pdb_reference, > any_number=1) > > def _scan_do(self, uhandle, consumer): > self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) > > def _scan_terminator(self, uhandle, consumer): > self._scan_line('//', uhandle, consumer.terminator, exactly_one=1) > > _scan_fns = [ > _scan_id, > _scan_ac, > _scan_dt, > _scan_de, > _scan_pa, > _scan_ma, > _scan_ru, > _scan_nr, > _scan_ma, ## (lambrecht/dyoo) is this right? > _scan_nr, ## (lambrecht/dyoo) is this right? > _scan_cc, > _scan_dr, > _scan_3d, > _scan_do, > _scan_terminator > ] > > class _RecordConsumer(AbstractConsumer): > """Consumer that converts a Prosite record to a Record object. > > Members: > data Record with Prosite data. > > """ > def __init__(self): > self.data = None > > def start_record(self): > self.data = Record() > > def end_record(self): > self._clean_record(self.data) > > def identification(self, line): > cols = string.split(line) > if len(cols) != 3: > raise SyntaxError, "I don't understand identification line\n%s" % \ > line > self.data.name = self._chomp(cols[1]) # don't want ';' > self.data.type = self._chomp(cols[2]) # don't want '.' > > def accession(self, line): > cols = string.split(line) > if len(cols) != 2: > raise SyntaxError, "I don't understand accession line\n%s" % line > self.data.accession = self._chomp(cols[1]) > > def date(self, line): > uprline = string.upper(line) > cols = string.split(uprline) > > # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' > if cols[2] != '(CREATED);' or \ > cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ > cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': > raise SyntaxError, "I don't understand date line\n%s" % line > > self.data.created = cols[1] > self.data.data_update = cols[3] > self.data.info_update = cols[6] > > def description(self, line): > self.data.description = self._clean(line) > > def pattern(self, line): > self.data.pattern = self.data.pattern + self._clean(line) > > def matrix(self, line): > self.data.matrix.append(self._clean(line)) > > def rule(self, line): > self.data.rules.append(self._clean(line)) > > def numerical_results(self, line): > cols = string.split(self._clean(line), ';') > for col in cols: > if not col: > continue > qual, data = map(string.lstrip, string.split(col, '=')) > if qual == '/RELEASE': > release, seqs = string.split(data, ',') > self.data.nr_sp_release = release > self.data.nr_sp_seqs = int(seqs) > elif qual == '/FALSE_NEG': > self.data.nr_false_neg = int(data) > elif qual == '/PARTIAL': > self.data.nr_partial = int(data) > ## (lambrecht/dyoo) added temporary fix for qual //MATRIX_TYPE in CC > elif qual =='/MATRIX_TYPE': > pass > elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: > m = re.match(r'(\d+)\((\d+)\)', data) > if not m: > raise error, "Broken data %s in comment line\n%s" % \ > (repr(data), line) > hits = tuple(map(int, m.groups())) > if(qual == "/TOTAL"): > self.data.nr_total = hits > elif(qual == "/POSITIVE"): > self.data.nr_positive = hits > elif(qual == "/UNKNOWN"): > self.data.nr_unknown = hits > elif(qual == "/FALSE_POS"): > self.data.nr_false_pos = hits > else: > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ > (repr(qual), line) > > def comment(self, line): > cols = string.split(self._clean(line), ';') > for col in cols: > # DNAJ_2 in Release 15 has a non-standard comment line: > # CC Automatic scaling using reversed database > # Throw it away. (Should I keep it?) > if not col or col[:17] == 'Automatic scaling': > continue > qual, data = map(string.lstrip, string.split(col, '=')) > if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR', > '/FT_KEY', '/FT_DESC'): > continue ## (lambrecht/dyoo) This is a temporary fix until we know what > ## to do here > if qual == '/TAXO-RANGE': > self.data.cc_taxo_range = data > elif qual == '/MAX-REPEAT': > self.data.cc_max_repeat = data > elif qual == '/SITE': > pos, desc = string.split(data, ',') > self.data.cc_site = (int(pos), desc) > elif qual == '/SKIP-FLAG': > self.data.cc_skip_flag = data > else: > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ > (repr(qual), line) > > def database_reference(self, line): > refs = string.split(self._clean(line), ';') > for ref in refs: > if not ref: > continue > acc, name, type = map(string.strip, string.split(ref, ',')) > if type == 'T': > self.data.dr_positive.append((acc, name)) > elif type == 'F': > self.data.dr_false_pos.append((acc, name)) > elif type == 'N': > self.data.dr_false_neg.append((acc, name)) > elif type == 'P': > self.data.dr_potential.append((acc, name)) > elif type == '?': > self.data.dr_unknown.append((acc, name)) > else: > raise SyntaxError, "I don't understand type flag %s" % type > > def pdb_reference(self, line): > cols = string.split(line) > for id in cols[1:]: # get all but the '3D' col > self.data.pdb_structs.append(self._chomp(id)) > > def documentation(self, line): > self.data.pdoc = self._chomp(self._clean(line)) > > def terminator(self, line): > pass > > def _chomp(self, word, to_chomp='.,;'): > # Remove the punctuation at the end of a word. > if word[-1] in to_chomp: > return word[:-1] > return word > > def _clean(self, line, rstrip=1): > # Clean up a line. > if rstrip: > return string.rstrip(line[5:]) > return line[5:] > > def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None): > """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> > list of PatternHit's > > Search a sequence for occurrences of Prosite patterns. You can > specify either a sequence in seq or a SwissProt/trEMBL ID or accession > in id. Only one of those should be given. If exclude_frequent > is true, then the patterns with the high probability of occurring > will be excluded. > > """ > if (seq and id) or not (seq or id): > raise ValueError, "Please specify either a sequence or an id" > handle = ExPASy.scanprosite1(seq, id, exclude_frequent) > return _extract_pattern_hits(handle) > > def _extract_pattern_hits(handle): > """_extract_pattern_hits(handle) -> list of PatternHit's > > Extract hits from a web page. Raises a ValueError if there > was an error in the query. > > """ > class parser(sgmllib.SGMLParser): > def __init__(self): > sgmllib.SGMLParser.__init__(self) > self.hits = [] > self.broken_message = 'Some error occurred' > self._in_pre = 0 > self._current_hit = None > self._last_found = None # Save state of parsing > def handle_data(self, data): > if string.find(data, 'try again') >= 0: > self.broken_message = data > return > elif data == 'illegal': > self.broken_message = 'Sequence contains illegal characters' > return > if not self._in_pre: > return > elif not string.strip(data): > return > if self._last_found is None and data[:4] == 'PDOC': > self._current_hit.pdoc = data > self._last_found = 'pdoc' > elif self._last_found == 'pdoc': > if data[:2] != 'PS': > raise SyntaxError, "Expected accession but got:\n%s" % data > self._current_hit.accession = data > self._last_found = 'accession' > elif self._last_found == 'accession': > self._current_hit.name = data > self._last_found = 'name' > elif self._last_found == 'name': > self._current_hit.description = data > self._last_found = 'description' > elif self._last_found == 'description': > m = re.findall(r'(\d+)-(\d+) (\w+)', data) > for start, end, seq in m: > self._current_hit.matches.append( > (int(start), int(end), seq)) > > def do_hr(self, attrs): > #
inside a
 section means a new hit.
>             if self._in_pre:
>                 self._current_hit = PatternHit()
>                 self.hits.append(self._current_hit)
>                 self._last_found = None
>         def start_pre(self, attrs):
>             self._in_pre = 1
>             self.broken_message = None   # Probably not broken
>         def end_pre(self):
>             self._in_pre = 0
>     p = parser()
>     p.feed(handle.read())
>     if p.broken_message:
>         raise ValueError, p.broken_message
>     return p.hits
> 
> 
>         
>     
> def index_file(filename, indexname, rec2key=None):
>     """index_file(filename, indexname, rec2key=None)
> 
>     Index a Prosite file.  filename is the name of the file.
>     indexname is the name of the dictionary.  rec2key is an
>     optional callback that takes a Record and generates a unique key
>     (e.g. the accession number) for the record.  If not specified,
>     the id name will be used.
> 
>     """
>     if not os.path.exists(filename):
>         raise ValueError, "%s does not exist" % filename
> 
>     index = Index.Index(indexname, truncate=1)
>     index[Dictionary._Dictionary__filename_key] = filename
>     
>     iter = Iterator(open(filename), parser=RecordParser())
>     while 1:
>         start = iter._uhandle.tell()
>         rec = iter.next()
>         length = iter._uhandle.tell() - start
>         
>         if rec is None:
>             break
>         if rec2key is not None:
>             key = rec2key(rec)
>         else:
>             key = rec.name
>             
>         if not key:
>             raise KeyError, "empty key was produced"
>         elif index.has_key(key):
>             raise KeyError, "duplicate key %s found" % key
> 
>         index[key] = start, length
> 
> def _extract_record(handle):
>     """_extract_record(handle) -> str
> 
>     Extract PROSITE data from a web page.  Raises a ValueError if no
>     data was found in the web page.
> 
>     """
>     # All the data appears between tags:
>     # 
ID   NIR_SIR; PATTERN.
>     # 
> class parser(sgmllib.SGMLParser): > def __init__(self): > sgmllib.SGMLParser.__init__(self) > self._in_pre = 0 > self.data = [] > def handle_data(self, data): > if self._in_pre: > self.data.append(data) > def do_br(self, attrs): > if self._in_pre: > self.data.append('\n') > def start_pre(self, attrs): > self._in_pre = 1 > def end_pre(self): > self._in_pre = 0 > p = parser() > p.feed(handle.read()) > if not p.data: > raise ValueError, "No data found in web page." > return string.join(p.data, '') > -------------- next part -------------- # Copyright 1999 by Jeffrey Chang. All rights reserved. # This code is part of the Biopython distribution and governed by its # license. Please see the LICENSE file that should have been included # as part of this package. # Copyright 2000 by Jeffrey Chang. All rights reserved. # This code is part of the Biopython distribution and governed by its # license. Please see the LICENSE file that should have been included # as part of this package. """Prosite This module provides code to work with the prosite.dat file from Prosite. http://www.expasy.ch/prosite/ Tested with: Release 15.0, July 1998 Release 16.0, July 1999 Release 17.0, Dec 2001 Classes: Record Holds Prosite data. PatternHit Holds data from a hit against a Prosite pattern. Iterator Iterates over entries in a Prosite file. Dictionary Accesses a Prosite file using a dictionary interface. ExPASyDictionary Accesses Prosite records from ExPASy. RecordParser Parses a Prosite record into a Record object. _Scanner Scans Prosite-formatted data. _RecordConsumer Consumes Prosite data to a Record object. Functions: scan_sequence_expasy Scan a sequence for occurrences of Prosite patterns. index_file Index a Prosite file for a Dictionary. _extract_record Extract Prosite data from a web page. _extract_pattern_hits Extract Prosite patterns from a web page. """ __all__ = [ 'Pattern', 'Prodoc', ] from types import * import string import re import sgmllib from Bio import File from Bio import Index from Bio.ParserSupport import * from Bio.WWW import ExPASy from Bio.WWW import RequestLimiter class Record: """Holds information from a Prosite record. Members: name ID of the record. e.g. ADH_ZINC type Type of entry. e.g. PATTERN, MATRIX, or RULE accession e.g. PS00387 created Date the entry was created. (MMM-YYYY) data_update Date the 'primary' data was last updated. info_update Date data other than 'primary' data was last updated. pdoc ID of the PROSITE DOCumentation. description Free-format description. pattern The PROSITE pattern. See docs. matrix List of strings that describes a matrix entry. rules List of rule definitions. (strings) NUMERICAL RESULTS nr_sp_release SwissProt release. nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) nr_positive True positives. tuple of (hits, seqs) nr_unknown Could be positives. tuple of (hits, seqs) nr_false_pos False positives. tuple of (hits, seqs) nr_false_neg False negatives. (int) nr_partial False negatives, because they are fragments. (int) COMMENTS cc_taxo_range Taxonomic range. See docs for format cc_max_repeat Maximum number of repetitions in a protein cc_site Interesting site. list of tuples (pattern pos, desc.) cc_skip_flag Can this entry be ignored? cc_matrix_type cc_scaling_db cc_author cc_ft_key cc_ft_desc DATA BANK REFERENCES - The following are all lists of tuples (swiss-prot accession, swiss-prot name) dr_positive dr_false_neg dr_false_pos dr_potential Potential hits, but fingerprint region not yet available. dr_unknown Could possibly belong pdb_structs List of PDB entries. """ def __init__(self): self.name = '' self.type = '' self.accession = '' self.created = '' self.data_update = '' self.info_update = '' self.pdoc = '' self.description = '' self.pattern = '' self.matrix = [] self.rules = [] self.nr_sp_release = '' self.nr_sp_seqs = '' self.nr_total = (None, None) self.nr_positive = (None, None) self.nr_unknown = (None, None) self.nr_false_pos = (None, None) self.nr_false_neg = None self.nr_partial = None self.cc_taxo_range = '' self.cc_max_repeat = '' self.cc_site = [] self.cc_skip_flag = '' self.dr_positive = [] self.dr_false_neg = [] self.dr_false_pos = [] self.dr_potential = [] self.dr_unknown = [] self.pdb_structs = [] class PatternHit: """Holds information from a hit against a Prosite pattern. Members: name ID of the record. e.g. ADH_ZINC accession e.g. PS00387 pdoc ID of the PROSITE DOCumentation. description Free-format description. matches List of tuples (start, end, sequence) where start and end are indexes of the match, and sequence is the sequence matched. """ def __init__(self): self.name = None self.accession = None self.pdoc = None self.description = None self.matches = [] def __str__(self): lines = [] lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) lines.append(self.description) lines.append('') if len(self.matches) > 1: lines.append("Number of matches: %s" % len(self.matches)) for i in range(len(self.matches)): start, end, seq = self.matches[i] range_str = "%d-%d" % (start, end) if len(self.matches) > 1: lines.append("%7d %10s %s" % (i+1, range_str, seq)) else: lines.append("%7s %10s %s" % (' ', range_str, seq)) return string.join(lines, '\n') class Iterator: """Returns one record at a time from a Prosite file. Methods: next Return the next record from the stream, or None. """ def __init__(self, handle, parser=None): """__init__(self, handle, parser=None) Create a new iterator. handle is a file-like object. parser is an optional Parser object to change the results into another form. If set to None, then the raw contents of the file will be returned. """ if type(handle) is not FileType and type(handle) is not InstanceType: raise ValueError, "I expected a file handle or file-like object" self._uhandle = File.UndoHandle(handle) self._parser = parser def next(self): """next(self) -> object Return the next Prosite record from the file. If no more records, return None. """ # Skip the copyright info, if it's the first record. line = self._uhandle.peekline() if line[:2] == 'CC': while 1: line = self._uhandle.readline() if not line: break if line[:2] == '//': break if line[:2] != 'CC': raise SyntaxError, \ "Oops, where's the copyright?" lines = [] while 1: line = self._uhandle.readline() if not line: break lines.append(line) if line[:2] == '//': break if not lines: return None data = string.join(lines, '') if self._parser is not None: return self._parser.parse(File.StringHandle(data)) return data class Dictionary: """Accesses a Prosite file using a dictionary interface. """ __filename_key = '__filename' def __init__(self, indexname, parser=None): """__init__(self, indexname, parser=None) Open a Prosite Dictionary. indexname is the name of the index for the dictionary. The index should have been created using the index_file function. parser is an optional Parser object to change the results into another form. If set to None, then the raw contents of the file will be returned. """ self._index = Index.Index(indexname) self._handle = open(self._index[Dictionary.__filename_key]) self._parser = parser def __len__(self): return len(self._index) def __getitem__(self, key): start, len = self._index[key] self._handle.seek(start) data = self._handle.read(len) if self._parser is not None: return self._parser.parse(File.StringHandle(data)) return data def __getattr__(self, name): return getattr(self._index, name) class ExPASyDictionary: """Access PROSITE at ExPASy using a read-only dictionary interface. """ def __init__(self, delay=5.0, parser=None): """__init__(self, delay=5.0, parser=None) Create a new Dictionary to access PROSITE. parser is an optional parser (e.g. Prosite.RecordParser) object to change the results into another form. If set to None, then the raw contents of the file will be returned. delay is the number of seconds to wait between each query. """ self.parser = parser self.limiter = RequestLimiter(delay) def __len__(self): raise NotImplementedError, "Prosite contains lots of entries" def clear(self): raise NotImplementedError, "This is a read-only dictionary" def __setitem__(self, key, item): raise NotImplementedError, "This is a read-only dictionary" def update(self): raise NotImplementedError, "This is a read-only dictionary" def copy(self): raise NotImplementedError, "You don't need to do this..." def keys(self): raise NotImplementedError, "You don't really want to do this..." def items(self): raise NotImplementedError, "You don't really want to do this..." def values(self): raise NotImplementedError, "You don't really want to do this..." def has_key(self, id): """has_key(self, id) -> bool""" try: self[id] except KeyError: return 0 return 1 def get(self, id, failobj=None): try: return self[id] except KeyError: return failobj raise "How did I get here?" def __getitem__(self, id): """__getitem__(self, id) -> object Return a Prosite entry. id is either the id or accession for the entry. Raises a KeyError if there's an error. """ # First, check to see if enough time has passed since my # last query. self.limiter.wait() try: handle = ExPASy.get_prosite_entry(id) except IOError: raise KeyError, id try: handle = File.StringHandle(_extract_record(handle)) except ValueError: raise KeyError, id if self.parser is not None: return self.parser.parse(handle) return handle.read() class RecordParser(AbstractParser): """Parses Prosite data into a Record object. """ def __init__(self): self._scanner = _Scanner() self._consumer = _RecordConsumer() def parse(self, handle): self._scanner.feed(handle, self._consumer) return self._consumer.data class _Scanner: """Scans Prosite-formatted data. Tested with: Release 15.0, July 1998 """ def feed(self, handle, consumer): """feed(self, handle, consumer) Feed in Prosite data for scanning. handle is a file-like object that contains prosite data. consumer is a Consumer object that will receive events as the report is scanned. """ if isinstance(handle, File.UndoHandle): uhandle = handle else: uhandle = File.UndoHandle(handle) while 1: line = uhandle.peekline() if not line: break elif is_blank_line(line): # Skip blank lines between records uhandle.readline() continue elif line[:2] == 'ID': self._scan_record(uhandle, consumer) elif line[:2] == 'CC': self._scan_copyrights(uhandle, consumer) else: raise SyntaxError, "There doesn't appear to be a record" def _scan_copyrights(self, uhandle, consumer): consumer.start_copyrights() self._scan_line('CC', uhandle, consumer.copyright, any_number=1) self._scan_terminator(uhandle, consumer) consumer.end_copyrights() def _scan_record(self, uhandle, consumer): consumer.start_record() for fn in self._scan_fns: fn(self, uhandle, consumer) # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before # the 3D lines, instead of the other way around. # Thus, I'll give the 3D lines another chance after the DO lines # are finished. if fn is self._scan_do.im_func: self._scan_3d(uhandle, consumer) consumer.end_record() def _scan_line(self, line_type, uhandle, event_fn, exactly_one=None, one_or_more=None, any_number=None, up_to_one=None): # Callers must set exactly one of exactly_one, one_or_more, or # any_number to a true value. I do not explicitly check to # make sure this function is called correctly. # This does not guarantee any parameter safety, but I # like the readability. The other strategy I tried was have # parameters min_lines, max_lines. if exactly_one or one_or_more: read_and_call(uhandle, event_fn, start=line_type) if one_or_more or any_number: while 1: if not attempt_read_and_call(uhandle, event_fn, start=line_type): break if up_to_one: attempt_read_and_call(uhandle, event_fn, start=line_type) def _scan_id(self, uhandle, consumer): self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) def _scan_ac(self, uhandle, consumer): self._scan_line('AC', uhandle, consumer.accession, exactly_one=1) def _scan_dt(self, uhandle, consumer): self._scan_line('DT', uhandle, consumer.date, exactly_one=1) def _scan_de(self, uhandle, consumer): self._scan_line('DE', uhandle, consumer.description, exactly_one=1) def _scan_pa(self, uhandle, consumer): self._scan_line('PA', uhandle, consumer.pattern, any_number=1) def _scan_ma(self, uhandle, consumer): self._scan_line('MA', uhandle, consumer.matrix, any_number=1) ## # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 ## # contain a CC line buried within an 'MA' line. Need to check ## # for that. ## while 1: ## if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): ## line1 = uhandle.readline() ## line2 = uhandle.readline() ## uhandle.saveline(line2) ## uhandle.saveline(line1) ## if line1[:2] == 'CC' and line2[:2] == 'MA': ## read_and_call(uhandle, consumer.comment, start='CC') ## else: ## break def _scan_ru(self, uhandle, consumer): self._scan_line('RU', uhandle, consumer.rule, any_number=1) def _scan_nr(self, uhandle, consumer): self._scan_line('NR', uhandle, consumer.numerical_results, any_number=1) def _scan_cc(self, uhandle, consumer): self._scan_line('CC', uhandle, consumer.comment, any_number=1) def _scan_dr(self, uhandle, consumer): self._scan_line('DR', uhandle, consumer.database_reference, any_number=1) def _scan_3d(self, uhandle, consumer): self._scan_line('3D', uhandle, consumer.pdb_reference, any_number=1) def _scan_do(self, uhandle, consumer): self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) def _scan_terminator(self, uhandle, consumer): self._scan_line('//', uhandle, consumer.terminator, exactly_one=1) _scan_fns = [ _scan_id, _scan_ac, _scan_dt, _scan_de, _scan_pa, _scan_ma, _scan_ru, _scan_nr, _scan_cc, # This is a really dirty hack, and should be fixed properly at # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 # in Rel 17 have lines out of order. Thus, I have to rescan # these, which decreases performance. _scan_ma, _scan_nr, _scan_cc, _scan_dr, _scan_3d, _scan_do, _scan_terminator ] class _RecordConsumer(AbstractConsumer): """Consumer that converts a Prosite record to a Record object. Members: data Record with Prosite data. """ def __init__(self): self.data = None def start_record(self): self.data = Record() def end_record(self): self._clean_record(self.data) def identification(self, line): cols = string.split(line) if len(cols) != 3: raise SyntaxError, "I don't understand identification line\n%s" % \ line self.data.name = self._chomp(cols[1]) # don't want ';' self.data.type = self._chomp(cols[2]) # don't want '.' def accession(self, line): cols = string.split(line) if len(cols) != 2: raise SyntaxError, "I don't understand accession line\n%s" % line self.data.accession = self._chomp(cols[1]) def date(self, line): uprline = string.upper(line) cols = string.split(uprline) # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' if cols[2] != '(CREATED);' or \ cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': raise SyntaxError, "I don't understand date line\n%s" % line self.data.created = cols[1] self.data.data_update = cols[3] self.data.info_update = cols[6] def description(self, line): self.data.description = self._clean(line) def pattern(self, line): self.data.pattern = self.data.pattern + self._clean(line) def matrix(self, line): self.data.matrix.append(self._clean(line)) def rule(self, line): self.data.rules.append(self._clean(line)) def numerical_results(self, line): cols = string.split(self._clean(line), ';') for col in cols: if not col: continue qual, data = map(string.lstrip, string.split(col, '=')) if qual == '/RELEASE': release, seqs = string.split(data, ',') self.data.nr_sp_release = release self.data.nr_sp_seqs = int(seqs) elif qual == '/FALSE_NEG': self.data.nr_false_neg = int(data) elif qual == '/PARTIAL': self.data.nr_partial = int(data) elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: m = re.match(r'(\d+)\((\d+)\)', data) if not m: raise error, "Broken data %s in comment line\n%s" % \ (repr(data), line) hits = tuple(map(int, m.groups())) if(qual == "/TOTAL"): self.data.nr_total = hits elif(qual == "/POSITIVE"): self.data.nr_positive = hits elif(qual == "/UNKNOWN"): self.data.nr_unknown = hits elif(qual == "/FALSE_POS"): self.data.nr_false_pos = hits else: raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ (repr(qual), line) def comment(self, line): cols = string.split(self._clean(line), ';') for col in cols: # DNAJ_2 in Release 15 has a non-standard comment line: # CC Automatic scaling using reversed database # Throw it away. (Should I keep it?) if not col or col[:17] == 'Automatic scaling': continue qual, data = map(string.lstrip, string.split(col, '=')) if qual == '/TAXO-RANGE': self.data.cc_taxo_range = data elif qual == '/MAX-REPEAT': self.data.cc_max_repeat = data elif qual == '/SITE': pos, desc = string.split(data, ',') self.data.cc_site = (int(pos), desc) elif qual == '/SKIP-FLAG': self.data.cc_skip_flag = data elif qual == '/MATRIX_TYPE': self.data.cc_matrix_type = data elif qual == '/SCALING_DB': self.data.cc_scaling_db = data elif qual == '/AUTHOR': self.data.cc_author = data elif qual == '/FT_KEY': self.data.cc_ft_key = data elif qual == '/FT_DESC': self.data.cc_ft_desc = data else: raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ (repr(qual), line) def database_reference(self, line): refs = string.split(self._clean(line), ';') for ref in refs: if not ref: continue acc, name, type = map(string.strip, string.split(ref, ',')) if type == 'T': self.data.dr_positive.append((acc, name)) elif type == 'F': self.data.dr_false_pos.append((acc, name)) elif type == 'N': self.data.dr_false_neg.append((acc, name)) elif type == 'P': self.data.dr_potential.append((acc, name)) elif type == '?': self.data.dr_unknown.append((acc, name)) else: raise SyntaxError, "I don't understand type flag %s" % type def pdb_reference(self, line): cols = string.split(line) for id in cols[1:]: # get all but the '3D' col self.data.pdb_structs.append(self._chomp(id)) def documentation(self, line): self.data.pdoc = self._chomp(self._clean(line)) def terminator(self, line): pass def _chomp(self, word, to_chomp='.,;'): # Remove the punctuation at the end of a word. if word[-1] in to_chomp: return word[:-1] return word def _clean(self, line, rstrip=1): # Clean up a line. if rstrip: return string.rstrip(line[5:]) return line[5:] def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None): """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> list of PatternHit's Search a sequence for occurrences of Prosite patterns. You can specify either a sequence in seq or a SwissProt/trEMBL ID or accession in id. Only one of those should be given. If exclude_frequent is true, then the patterns with the high probability of occurring will be excluded. """ if (seq and id) or not (seq or id): raise ValueError, "Please specify either a sequence or an id" handle = ExPASy.scanprosite1(seq, id, exclude_frequent) return _extract_pattern_hits(handle) def _extract_pattern_hits(handle): """_extract_pattern_hits(handle) -> list of PatternHit's Extract hits from a web page. Raises a ValueError if there was an error in the query. """ class parser(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self.hits = [] self.broken_message = 'Some error occurred' self._in_pre = 0 self._current_hit = None self._last_found = None # Save state of parsing def handle_data(self, data): if string.find(data, 'try again') >= 0: self.broken_message = data return elif data == 'illegal': self.broken_message = 'Sequence contains illegal characters' return if not self._in_pre: return elif not string.strip(data): return if self._last_found is None and data[:4] == 'PDOC': self._current_hit.pdoc = data self._last_found = 'pdoc' elif self._last_found == 'pdoc': if data[:2] != 'PS': raise SyntaxError, "Expected accession but got:\n%s" % data self._current_hit.accession = data self._last_found = 'accession' elif self._last_found == 'accession': self._current_hit.name = data self._last_found = 'name' elif self._last_found == 'name': self._current_hit.description = data self._last_found = 'description' elif self._last_found == 'description': m = re.findall(r'(\d+)-(\d+) (\w+)', data) for start, end, seq in m: self._current_hit.matches.append( (int(start), int(end), seq)) def do_hr(self, attrs): #
inside a
 section means a new hit.
            if self._in_pre:
                self._current_hit = PatternHit()
                self.hits.append(self._current_hit)
                self._last_found = None
        def start_pre(self, attrs):
            self._in_pre = 1
            self.broken_message = None   # Probably not broken
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if p.broken_message:
        raise ValueError, p.broken_message
    return p.hits


        
    
def index_file(filename, indexname, rec2key=None):
    """index_file(filename, indexname, rec2key=None)

    Index a Prosite file.  filename is the name of the file.
    indexname is the name of the dictionary.  rec2key is an
    optional callback that takes a Record and generates a unique key
    (e.g. the accession number) for the record.  If not specified,
    the id name will be used.

    """
    if not os.path.exists(filename):
        raise ValueError, "%s does not exist" % filename

    index = Index.Index(indexname, truncate=1)
    index[Dictionary._Dictionary__filename_key] = filename
    
    iter = Iterator(open(filename), parser=RecordParser())
    while 1:
        start = iter._uhandle.tell()
        rec = iter.next()
        length = iter._uhandle.tell() - start
        
        if rec is None:
            break
        if rec2key is not None:
            key = rec2key(rec)
        else:
            key = rec.name
            
        if not key:
            raise KeyError, "empty key was produced"
        elif index.has_key(key):
            raise KeyError, "duplicate key %s found" % key

        index[key] = start, length

def _extract_record(handle):
    """_extract_record(handle) -> str

    Extract PROSITE data from a web page.  Raises a ValueError if no
    data was found in the web page.

    """
    # All the data appears between tags:
    # 
ID   NIR_SIR; PATTERN.
    # 
class parser(sgmllib.SGMLParser): def __init__(self): sgmllib.SGMLParser.__init__(self) self._in_pre = 0 self.data = [] def handle_data(self, data): if self._in_pre: self.data.append(data) def do_br(self, attrs): if self._in_pre: self.data.append('\n') def start_pre(self, attrs): self._in_pre = 1 def end_pre(self): self._in_pre = 0 p = parser() p.feed(handle.read()) if not p.data: raise ValueError, "No data found in web page." return string.join(p.data, '') From mark at acoma.Stanford.EDU Thu Jan 24 17:48:39 2002 From: mark at acoma.Stanford.EDU (Mark Lambrecht) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Prosite In-Reply-To: <20020124142801.D372@krusty.stanford.edu> Message-ID: Jeff, Everything works fine now. You saved my day : I needed the info in the prosite.dat file. Thanks, Mark On Thu, 24 Jan 2002, Jeffrey Chang wrote: > Yep, it looks like Release 17 from last month introduced some format > changes that broke the parser. I've updated the parser to handle the > new lines -- __init__.py is attached. Please try this out and let me > know how it works. Thanks for the report and the patch! > > Jeff > > > On Wed, Jan 23, 2002 at 12:36:51PM -0800, Mark Lambrecht wrote: > > Hi, > > > > Thanks for all the excellent Biopython code. > > I used the Prosite parser and it breaks on a number of CC and MA lines. > > Maybe there is a new version of the prosite.dat file ? > > We added some code to the Bio/Prosite/__init__.py , and commented it with > > ## (lambrecht/dyoo) > > Then everything works again but possibly doesn't use the information in > > these lines. > > I attached the __init__.py > > Could you take a look ? > > > > Thanks !! > > > > Mark > > > > > > -------------------------------------------------------------------------- > > Mark Lambrecht > > Postdoctoral Research Fellow > > The Arabidopsis Information Resource FAX: (650) 325-6857 > > Carnegie Institution of Washington Tel: (650) 325-1521 ext.397 > > Department of Plant Biology URL: http://arabidopsis.org/ > > 260 Panama St. > > Stanford, CA 94305 > > -------------------------------------------------------------------------- > > > # Copyright 1999 by Jeffrey Chang. All rights reserved. > > # This code is part of the Biopython distribution and governed by its > > # license. Please see the LICENSE file that should have been included > > # as part of this package. > > > > # Copyright 2000 by Jeffrey Chang. All rights reserved. > > # This code is part of the Biopython distribution and governed by its > > # license. Please see the LICENSE file that should have been included > > # as part of this package. > > > > """Prosite > > > > This module provides code to work with the prosite.dat file from > > Prosite. > > http://www.expasy.ch/prosite/ > > > > Tested with: > > Release 15.0, July 1998 > > Release 16.0, July 1999 > > > > > > Classes: > > Record Holds Prosite data. > > PatternHit Holds data from a hit against a Prosite pattern. > > Iterator Iterates over entries in a Prosite file. > > Dictionary Accesses a Prosite file using a dictionary interface. > > ExPASyDictionary Accesses Prosite records from ExPASy. > > RecordParser Parses a Prosite record into a Record object. > > > > _Scanner Scans Prosite-formatted data. > > _RecordConsumer Consumes Prosite data to a Record object. > > > > > > Functions: > > scan_sequence_expasy Scan a sequence for occurrences of Prosite patterns. > > index_file Index a Prosite file for a Dictionary. > > _extract_record Extract Prosite data from a web page. > > _extract_pattern_hits Extract Prosite patterns from a web page. > > > > """ > > __all__ = [ > > 'Pattern', > > 'Prodoc', > > ] > > from types import * > > import string > > import re > > import sgmllib > > from Bio import File > > from Bio import Index > > from Bio.ParserSupport import * > > from Bio.WWW import ExPASy > > from Bio.WWW import RequestLimiter > > > > class Record: > > """Holds information from a Prosite record. > > > > Members: > > name ID of the record. e.g. ADH_ZINC > > type Type of entry. e.g. PATTERN, MATRIX, or RULE > > accession e.g. PS00387 > > created Date the entry was created. (MMM-YYYY) > > data_update Date the 'primary' data was last updated. > > info_update Date data other than 'primary' data was last updated. > > pdoc ID of the PROSITE DOCumentation. > > > > description Free-format description. > > pattern The PROSITE pattern. See docs. > > matrix List of strings that describes a matrix entry. > > rules List of rule definitions. (strings) > > > > NUMERICAL RESULTS > > nr_sp_release SwissProt release. > > nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) > > nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) > > nr_positive True positives. tuple of (hits, seqs) > > nr_unknown Could be positives. tuple of (hits, seqs) > > nr_false_pos False positives. tuple of (hits, seqs) > > nr_false_neg False negatives. (int) > > nr_partial False negatives, because they are fragments. (int) > > > > COMMENTS > > cc_taxo_range Taxonomic range. See docs for format > > cc_max_repeat Maximum number of repetitions in a protein > > cc_site Interesting site. list of tuples (pattern pos, desc.) > > cc_skip_flag Can this entry be ignored? > > > > DATA BANK REFERENCES - The following are all > > lists of tuples (swiss-prot accession, > > swiss-prot name) > > dr_positive > > dr_false_neg > > dr_false_pos > > dr_potential Potential hits, but fingerprint region not yet available. > > dr_unknown Could possibly belong > > > > pdb_structs List of PDB entries. > > > > """ > > def __init__(self): > > self.name = '' > > self.type = '' > > self.accession = '' > > self.created = '' > > self.data_update = '' > > self.info_update = '' > > self.pdoc = '' > > > > self.description = '' > > self.pattern = '' > > self.matrix = [] > > self.rules = [] > > > > self.nr_sp_release = '' > > self.nr_sp_seqs = '' > > self.nr_total = (None, None) > > self.nr_positive = (None, None) > > self.nr_unknown = (None, None) > > self.nr_false_pos = (None, None) > > self.nr_false_neg = None > > self.nr_partial = None > > > > self.cc_taxo_range = '' > > self.cc_max_repeat = '' > > self.cc_site = [] > > self.cc_skip_flag = '' > > > > self.dr_positive = [] > > self.dr_false_neg = [] > > self.dr_false_pos = [] > > self.dr_potential = [] > > self.dr_unknown = [] > > > > self.pdb_structs = [] > > > > class PatternHit: > > """Holds information from a hit against a Prosite pattern. > > > > Members: > > name ID of the record. e.g. ADH_ZINC > > accession e.g. PS00387 > > pdoc ID of the PROSITE DOCumentation. > > description Free-format description. > > matches List of tuples (start, end, sequence) where > > start and end are indexes of the match, and sequence is > > the sequence matched. > > > > """ > > def __init__(self): > > self.name = None > > self.accession = None > > self.pdoc = None > > self.description = None > > self.matches = [] > > def __str__(self): > > lines = [] > > lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) > > lines.append(self.description) > > lines.append('') > > if len(self.matches) > 1: > > lines.append("Number of matches: %s" % len(self.matches)) > > for i in range(len(self.matches)): > > start, end, seq = self.matches[i] > > range_str = "%d-%d" % (start, end) > > if len(self.matches) > 1: > > lines.append("%7d %10s %s" % (i+1, range_str, seq)) > > else: > > lines.append("%7s %10s %s" % (' ', range_str, seq)) > > return string.join(lines, '\n') > > > > class Iterator: > > """Returns one record at a time from a Prosite file. > > > > Methods: > > next Return the next record from the stream, or None. > > > > """ > > def __init__(self, handle, parser=None): > > """__init__(self, handle, parser=None) > > > > Create a new iterator. handle is a file-like object. parser > > is an optional Parser object to change the results into another form. > > If set to None, then the raw contents of the file will be returned. > > > > """ > > if type(handle) is not FileType and type(handle) is not InstanceType: > > raise ValueError, "I expected a file handle or file-like object" > > self._uhandle = File.UndoHandle(handle) > > self._parser = parser > > > > def next(self): > > """next(self) -> object > > > > Return the next Prosite record from the file. If no more records, > > return None. > > > > """ > > # Skip the copyright info, if it's the first record. > > line = self._uhandle.peekline() > > if line[:2] == 'CC': > > while 1: > > line = self._uhandle.readline() > > if not line: > > break > > if line[:2] == '//': > > break > > if line[:2] != 'CC': > > raise SyntaxError, \ > > "Oops, where's the copyright?" > > > > lines = [] > > while 1: > > line = self._uhandle.readline() > > if not line: > > break > > lines.append(line) > > if line[:2] == '//': > > break > > > > if not lines: > > return None > > > > data = string.join(lines, '') > > if self._parser is not None: > > return self._parser.parse(File.StringHandle(data)) > > return data > > > > class Dictionary: > > """Accesses a Prosite file using a dictionary interface. > > > > """ > > __filename_key = '__filename' > > > > def __init__(self, indexname, parser=None): > > """__init__(self, indexname, parser=None) > > > > Open a Prosite Dictionary. indexname is the name of the > > index for the dictionary. The index should have been created > > using the index_file function. parser is an optional Parser > > object to change the results into another form. If set to None, > > then the raw contents of the file will be returned. > > > > """ > > self._index = Index.Index(indexname) > > self._handle = open(self._index[Dictionary.__filename_key]) > > self._parser = parser > > > > def __len__(self): > > return len(self._index) > > > > def __getitem__(self, key): > > start, len = self._index[key] > > self._handle.seek(start) > > data = self._handle.read(len) > > if self._parser is not None: > > return self._parser.parse(File.StringHandle(data)) > > return data > > > > def __getattr__(self, name): > > return getattr(self._index, name) > > > > class ExPASyDictionary: > > """Access PROSITE at ExPASy using a read-only dictionary interface. > > > > """ > > def __init__(self, delay=5.0, parser=None): > > """__init__(self, delay=5.0, parser=None) > > > > Create a new Dictionary to access PROSITE. parser is an optional > > parser (e.g. Prosite.RecordParser) object to change the results > > into another form. If set to None, then the raw contents of the > > file will be returned. delay is the number of seconds to wait > > between each query. > > > > """ > > self.parser = parser > > self.limiter = RequestLimiter(delay) > > > > def __len__(self): > > raise NotImplementedError, "Prosite contains lots of entries" > > def clear(self): > > raise NotImplementedError, "This is a read-only dictionary" > > def __setitem__(self, key, item): > > raise NotImplementedError, "This is a read-only dictionary" > > def update(self): > > raise NotImplementedError, "This is a read-only dictionary" > > def copy(self): > > raise NotImplementedError, "You don't need to do this..." > > def keys(self): > > raise NotImplementedError, "You don't really want to do this..." > > def items(self): > > raise NotImplementedError, "You don't really want to do this..." > > def values(self): > > raise NotImplementedError, "You don't really want to do this..." > > > > def has_key(self, id): > > """has_key(self, id) -> bool""" > > try: > > self[id] > > except KeyError: > > return 0 > > return 1 > > > > def get(self, id, failobj=None): > > try: > > return self[id] > > except KeyError: > > return failobj > > raise "How did I get here?" > > > > def __getitem__(self, id): > > """__getitem__(self, id) -> object > > > > Return a Prosite entry. id is either the id or accession > > for the entry. Raises a KeyError if there's an error. > > > > """ > > # First, check to see if enough time has passed since my > > # last query. > > self.limiter.wait() > > > > try: > > handle = ExPASy.get_prosite_entry(id) > > except IOError: > > raise KeyError, id > > try: > > handle = File.StringHandle(_extract_record(handle)) > > except ValueError: > > raise KeyError, id > > > > if self.parser is not None: > > return self.parser.parse(handle) > > return handle.read() > > > > class RecordParser(AbstractParser): > > """Parses Prosite data into a Record object. > > > > """ > > def __init__(self): > > self._scanner = _Scanner() > > self._consumer = _RecordConsumer() > > > > def parse(self, handle): > > self._scanner.feed(handle, self._consumer) > > return self._consumer.data > > > > class _Scanner: > > """Scans Prosite-formatted data. > > > > Tested with: > > Release 15.0, July 1998 > > > > """ > > def feed(self, handle, consumer): > > """feed(self, handle, consumer) > > > > Feed in Prosite data for scanning. handle is a file-like > > object that contains prosite data. consumer is a > > Consumer object that will receive events as the report is scanned. > > > > """ > > if isinstance(handle, File.UndoHandle): > > uhandle = handle > > else: > > uhandle = File.UndoHandle(handle) > > > > while 1: > > line = uhandle.peekline() > > if not line: > > break > > elif is_blank_line(line): > > # Skip blank lines between records > > uhandle.readline() > > continue > > elif line[:2] == 'ID': > > self._scan_record(uhandle, consumer) > > elif line[:2] == 'CC': > > self._scan_copyrights(uhandle, consumer) > > else: > > raise SyntaxError, "There doesn't appear to be a record" > > > > def _scan_copyrights(self, uhandle, consumer): > > consumer.start_copyrights() > > self._scan_line('CC', uhandle, consumer.copyright, any_number=1) > > self._scan_terminator(uhandle, consumer) > > consumer.end_copyrights() > > > > def _scan_record(self, uhandle, consumer): > > consumer.start_record() > > for fn in self._scan_fns: > > fn(self, uhandle, consumer) > > > > # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before > > # the 3D lines, instead of the other way around. > > # Thus, I'll give the 3D lines another chance after the DO lines > > # are finished. > > if fn is self._scan_do.im_func: > > self._scan_3d(uhandle, consumer) > > consumer.end_record() > > > > def _scan_line(self, line_type, uhandle, event_fn, > > exactly_one=None, one_or_more=None, any_number=None, > > up_to_one=None): > > # Callers must set exactly one of exactly_one, one_or_more, or > > # any_number to a true value. I do not explicitly check to > > # make sure this function is called correctly. > > > > # This does not guarantee any parameter safety, but I > > # like the readability. The other strategy I tried was have > > # parameters min_lines, max_lines. > > > > if exactly_one or one_or_more: > > read_and_call(uhandle, event_fn, start=line_type) > > if one_or_more or any_number: > > while 1: > > if not attempt_read_and_call(uhandle, event_fn, > > start=line_type): > > break > > if up_to_one: > > attempt_read_and_call(uhandle, event_fn, start=line_type) > > > > def _scan_id(self, uhandle, consumer): > > self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) > > > > def _scan_ac(self, uhandle, consumer): > > self._scan_line('AC', uhandle, consumer.accession, exactly_one=1) > > > > def _scan_dt(self, uhandle, consumer): > > self._scan_line('DT', uhandle, consumer.date, exactly_one=1) > > > > def _scan_de(self, uhandle, consumer): > > self._scan_line('DE', uhandle, consumer.description, exactly_one=1) > > > > def _scan_pa(self, uhandle, consumer): > > self._scan_line('PA', uhandle, consumer.pattern, any_number=1) > > > > def _scan_ma(self, uhandle, consumer): > > # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 > > # contain a CC line buried within an 'MA' line. Need to check > > # for that. > > while 1: > > if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): > > line1 = uhandle.readline() > > line2 = uhandle.readline() > > uhandle.saveline(line2) > > uhandle.saveline(line1) > > if line1[:2] == 'CC' and line2[:2] == 'MA': > > read_and_call(uhandle, consumer.comment, start='CC') > > else: > > break > > > > def _scan_ru(self, uhandle, consumer): > > self._scan_line('RU', uhandle, consumer.rule, any_number=1) > > > > def _scan_nr(self, uhandle, consumer): > > self._scan_line('NR', uhandle, consumer.numerical_results, > > any_number=1) > > > > def _scan_cc(self, uhandle, consumer): > > self._scan_line('CC', uhandle, consumer.comment, any_number=1) > > > > def _scan_dr(self, uhandle, consumer): > > self._scan_line('DR', uhandle, consumer.database_reference, > > any_number=1) > > > > def _scan_3d(self, uhandle, consumer): > > self._scan_line('3D', uhandle, consumer.pdb_reference, > > any_number=1) > > > > def _scan_do(self, uhandle, consumer): > > self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) > > > > def _scan_terminator(self, uhandle, consumer): > > self._scan_line('//', uhandle, consumer.terminator, exactly_one=1) > > > > _scan_fns = [ > > _scan_id, > > _scan_ac, > > _scan_dt, > > _scan_de, > > _scan_pa, > > _scan_ma, > > _scan_ru, > > _scan_nr, > > _scan_ma, ## (lambrecht/dyoo) is this right? > > _scan_nr, ## (lambrecht/dyoo) is this right? > > _scan_cc, > > _scan_dr, > > _scan_3d, > > _scan_do, > > _scan_terminator > > ] > > > > class _RecordConsumer(AbstractConsumer): > > """Consumer that converts a Prosite record to a Record object. > > > > Members: > > data Record with Prosite data. > > > > """ > > def __init__(self): > > self.data = None > > > > def start_record(self): > > self.data = Record() > > > > def end_record(self): > > self._clean_record(self.data) > > > > def identification(self, line): > > cols = string.split(line) > > if len(cols) != 3: > > raise SyntaxError, "I don't understand identification line\n%s" % \ > > line > > self.data.name = self._chomp(cols[1]) # don't want ';' > > self.data.type = self._chomp(cols[2]) # don't want '.' > > > > def accession(self, line): > > cols = string.split(line) > > if len(cols) != 2: > > raise SyntaxError, "I don't understand accession line\n%s" % line > > self.data.accession = self._chomp(cols[1]) > > > > def date(self, line): > > uprline = string.upper(line) > > cols = string.split(uprline) > > > > # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' > > if cols[2] != '(CREATED);' or \ > > cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ > > cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': > > raise SyntaxError, "I don't understand date line\n%s" % line > > > > self.data.created = cols[1] > > self.data.data_update = cols[3] > > self.data.info_update = cols[6] > > > > def description(self, line): > > self.data.description = self._clean(line) > > > > def pattern(self, line): > > self.data.pattern = self.data.pattern + self._clean(line) > > > > def matrix(self, line): > > self.data.matrix.append(self._clean(line)) > > > > def rule(self, line): > > self.data.rules.append(self._clean(line)) > > > > def numerical_results(self, line): > > cols = string.split(self._clean(line), ';') > > for col in cols: > > if not col: > > continue > > qual, data = map(string.lstrip, string.split(col, '=')) > > if qual == '/RELEASE': > > release, seqs = string.split(data, ',') > > self.data.nr_sp_release = release > > self.data.nr_sp_seqs = int(seqs) > > elif qual == '/FALSE_NEG': > > self.data.nr_false_neg = int(data) > > elif qual == '/PARTIAL': > > self.data.nr_partial = int(data) > > ## (lambrecht/dyoo) added temporary fix for qual //MATRIX_TYPE in CC > > elif qual =='/MATRIX_TYPE': > > pass > > elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: > > m = re.match(r'(\d+)\((\d+)\)', data) > > if not m: > > raise error, "Broken data %s in comment line\n%s" % \ > > (repr(data), line) > > hits = tuple(map(int, m.groups())) > > if(qual == "/TOTAL"): > > self.data.nr_total = hits > > elif(qual == "/POSITIVE"): > > self.data.nr_positive = hits > > elif(qual == "/UNKNOWN"): > > self.data.nr_unknown = hits > > elif(qual == "/FALSE_POS"): > > self.data.nr_false_pos = hits > > else: > > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ > > (repr(qual), line) > > > > def comment(self, line): > > cols = string.split(self._clean(line), ';') > > for col in cols: > > # DNAJ_2 in Release 15 has a non-standard comment line: > > # CC Automatic scaling using reversed database > > # Throw it away. (Should I keep it?) > > if not col or col[:17] == 'Automatic scaling': > > continue > > qual, data = map(string.lstrip, string.split(col, '=')) > > if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR', > > '/FT_KEY', '/FT_DESC'): > > continue ## (lambrecht/dyoo) This is a temporary fix until we know what > > ## to do here > > if qual == '/TAXO-RANGE': > > self.data.cc_taxo_range = data > > elif qual == '/MAX-REPEAT': > > self.data.cc_max_repeat = data > > elif qual == '/SITE': > > pos, desc = string.split(data, ',') > > self.data.cc_site = (int(pos), desc) > > elif qual == '/SKIP-FLAG': > > self.data.cc_skip_flag = data > > else: > > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \ > > (repr(qual), line) > > > > def database_reference(self, line): > > refs = string.split(self._clean(line), ';') > > for ref in refs: > > if not ref: > > continue > > acc, name, type = map(string.strip, string.split(ref, ',')) > > if type == 'T': > > self.data.dr_positive.append((acc, name)) > > elif type == 'F': > > self.data.dr_false_pos.append((acc, name)) > > elif type == 'N': > > self.data.dr_false_neg.append((acc, name)) > > elif type == 'P': > > self.data.dr_potential.append((acc, name)) > > elif type == '?': > > self.data.dr_unknown.append((acc, name)) > > else: > > raise SyntaxError, "I don't understand type flag %s" % type > > > > def pdb_reference(self, line): > > cols = string.split(line) > > for id in cols[1:]: # get all but the '3D' col > > self.data.pdb_structs.append(self._chomp(id)) > > > > def documentation(self, line): > > self.data.pdoc = self._chomp(self._clean(line)) > > > > def terminator(self, line): > > pass > > > > def _chomp(self, word, to_chomp='.,;'): > > # Remove the punctuation at the end of a word. > > if word[-1] in to_chomp: > > return word[:-1] > > return word > > > > def _clean(self, line, rstrip=1): > > # Clean up a line. > > if rstrip: > > return string.rstrip(line[5:]) > > return line[5:] > > > > def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None): > > """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> > > list of PatternHit's > > > > Search a sequence for occurrences of Prosite patterns. You can > > specify either a sequence in seq or a SwissProt/trEMBL ID or accession > > in id. Only one of those should be given. If exclude_frequent > > is true, then the patterns with the high probability of occurring > > will be excluded. > > > > """ > > if (seq and id) or not (seq or id): > > raise ValueError, "Please specify either a sequence or an id" > > handle = ExPASy.scanprosite1(seq, id, exclude_frequent) > > return _extract_pattern_hits(handle) > > > > def _extract_pattern_hits(handle): > > """_extract_pattern_hits(handle) -> list of PatternHit's > > > > Extract hits from a web page. Raises a ValueError if there > > was an error in the query. > > > > """ > > class parser(sgmllib.SGMLParser): > > def __init__(self): > > sgmllib.SGMLParser.__init__(self) > > self.hits = [] > > self.broken_message = 'Some error occurred' > > self._in_pre = 0 > > self._current_hit = None > > self._last_found = None # Save state of parsing > > def handle_data(self, data): > > if string.find(data, 'try again') >= 0: > > self.broken_message = data > > return > > elif data == 'illegal': > > self.broken_message = 'Sequence contains illegal characters' > > return > > if not self._in_pre: > > return > > elif not string.strip(data): > > return > > if self._last_found is None and data[:4] == 'PDOC': > > self._current_hit.pdoc = data > > self._last_found = 'pdoc' > > elif self._last_found == 'pdoc': > > if data[:2] != 'PS': > > raise SyntaxError, "Expected accession but got:\n%s" % data > > self._current_hit.accession = data > > self._last_found = 'accession' > > elif self._last_found == 'accession': > > self._current_hit.name = data > > self._last_found = 'name' > > elif self._last_found == 'name': > > self._current_hit.description = data > > self._last_found = 'description' > > elif self._last_found == 'description': > > m = re.findall(r'(\d+)-(\d+) (\w+)', data) > > for start, end, seq in m: > > self._current_hit.matches.append( > > (int(start), int(end), seq)) > > > > def do_hr(self, attrs): > > #
inside a
 section means a new hit.
> >             if self._in_pre:
> >                 self._current_hit = PatternHit()
> >                 self.hits.append(self._current_hit)
> >                 self._last_found = None
> >         def start_pre(self, attrs):
> >             self._in_pre = 1
> >             self.broken_message = None   # Probably not broken
> >         def end_pre(self):
> >             self._in_pre = 0
> >     p = parser()
> >     p.feed(handle.read())
> >     if p.broken_message:
> >         raise ValueError, p.broken_message
> >     return p.hits
> >
> >
> >
> >
> > def index_file(filename, indexname, rec2key=None):
> >     """index_file(filename, indexname, rec2key=None)
> >
> >     Index a Prosite file.  filename is the name of the file.
> >     indexname is the name of the dictionary.  rec2key is an
> >     optional callback that takes a Record and generates a unique key
> >     (e.g. the accession number) for the record.  If not specified,
> >     the id name will be used.
> >
> >     """
> >     if not os.path.exists(filename):
> >         raise ValueError, "%s does not exist" % filename
> >
> >     index = Index.Index(indexname, truncate=1)
> >     index[Dictionary._Dictionary__filename_key] = filename
> >
> >     iter = Iterator(open(filename), parser=RecordParser())
> >     while 1:
> >         start = iter._uhandle.tell()
> >         rec = iter.next()
> >         length = iter._uhandle.tell() - start
> >
> >         if rec is None:
> >             break
> >         if rec2key is not None:
> >             key = rec2key(rec)
> >         else:
> >             key = rec.name
> >
> >         if not key:
> >             raise KeyError, "empty key was produced"
> >         elif index.has_key(key):
> >             raise KeyError, "duplicate key %s found" % key
> >
> >         index[key] = start, length
> >
> > def _extract_record(handle):
> >     """_extract_record(handle) -> str
> >
> >     Extract PROSITE data from a web page.  Raises a ValueError if no
> >     data was found in the web page.
> >
> >     """
> >     # All the data appears between tags:
> >     # 
ID   NIR_SIR; PATTERN.
> >     # 
> > class parser(sgmllib.SGMLParser): > > def __init__(self): > > sgmllib.SGMLParser.__init__(self) > > self._in_pre = 0 > > self.data = [] > > def handle_data(self, data): > > if self._in_pre: > > self.data.append(data) > > def do_br(self, attrs): > > if self._in_pre: > > self.data.append('\n') > > def start_pre(self, attrs): > > self._in_pre = 1 > > def end_pre(self): > > self._in_pre = 0 > > p = parser() > > p.feed(handle.read()) > > if not p.data: > > raise ValueError, "No data found in web page." > > return string.join(p.data, '') > > > > -------------------------------------------------------------------------- Mark Lambrecht Postdoctoral Research Fellow The Arabidopsis Information Resource FAX: (650) 325-6857 Carnegie Institution of Washington Tel: (650) 325-1521 ext.397 Department of Plant Biology URL: http://arabidopsis.org/ 260 Panama St. Stanford, CA 94305 -------------------------------------------------------------------------- From johann at egenetics.com Fri Jan 25 03:11:29 2002 From: johann at egenetics.com (Johann Visagie) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Re: [Zopymed] Hello fellow snakes !!! In-Reply-To: References: <3C4FC5BA.59361A92@fagmed.uit.no> Message-ID: <20020125081129.GA56371@fling.sanbi.ac.za> chrisf@fagmed.uit.no on 2002-01-24: > > Biopython zope objects complete with interface (Blast, .....) and output > (BioXML,....) > Bioperl zope objects complete with interface (T-coffee) and output > (BioXML, GAME, ...) > Web maintainers could just install the products and Voila. > Biopython ZClasses (and maybe BioPerl objects?) could be subclassed for > special uses through properties interface. > The output of one zope product could be fed into another allowing for > complex scripts. > A weird example. > From Swissprot chose trypsin->fasta->phylogeny->align each > group->consesus for each group->common restriction enzymes, ..... I think these are the sort of ideas that need to be shared with the Biopython (and possibly Bioperl) developers! :-) Jon Edwards on 2002-01-25 (Fri) at 00:35:01 -0000: > > For those of us not familiar with the BioInformatics field, could you give a > little more explanation of some of those terms? > [ snip ] > > Johann Visagie mentioned in an earlier post - > > "The concept of using Zope to build a set of "bio-web-widgets" on > top of Biopython has even been mooted at times." > > - is that the sort of thing you mean? Quite, yes. I should mention that it had been mooted mostly by me, and mentioned during the BioPython BoF at BOSC 2000. To my knowledge, no actual work - or even serious discussion - along these lines has yet been undertaken by anyone. At least not using Zope! > Would this be mainly for the > Bioinformatics community, or would it also be useful for other medical > fields? Mostly Bioinformatics, I would assume. > Please excuse my ignorance, I'm from a techie, not a medical background :-) Ditto. :-) -- V From biopython-bugs at bioperl.org Sat Jan 26 13:22:55 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/55 Message-ID: <200201261822.g0QIMtA00439@pw600a.bioperl.org> JitterBug notification new message incoming/55 Message summary for PR#55 From: newgene@bigfoot.com Subject: GenBank parser problem? Date: Sat, 26 Jan 2002 13:22:55 -0500 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From newgene@bigfoot.com Sat Jan 26 13:22:55 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434 for ; Sat, 26 Jan 2002 13:22:55 -0500 Date: Sat, 26 Jan 2002 13:22:55 -0500 Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org> From: newgene@bigfoot.com To: biopython-bugs@bioperl.org Subject: GenBank parser problem? Full_Name: Chunlei Wu Module: Bio/GenBank Version: 1.00a4 OS: win2000 Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178) Python version: ActivePython 2.1.1 Symptom: >>> from Bio import GenBank >>> gi=GenBank.search_for("NM_007355")[0] >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) >>> record=ncbi_dict[gi] Traceback (most recent call last): File "", line 1, in ? File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__ return self.parser.parse(handle) File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse self._scanner.feed(handle, self._consumer) File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed self._parser.parseFile(handle) File "C:\Python21\Martel\Parser.py", line 230, in parseFile self.parseString(fileobj.read()) File "C:\Python21\Martel\Parser.py", line 258, in parseString self._err_handler.fatalError(result) File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError raise exception ParserPositionException: error parsing at or beyond character 55 >>> Did GenBank change the format? Thanks. Chunlei From r.grenyer at ic.ac.uk Sun Jan 27 19:50:17 2002 From: r.grenyer at ic.ac.uk (Rich Grenyer) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/55 In-Reply-To: <200201261822.g0QIMtA00439@pw600a.bioperl.org> References: <200201261822.g0QIMtA00439@pw600a.bioperl.org> Message-ID: Found the same problem on Friday with BioPython1.00a4 on both a MacPython2.2 and a Linux Python2.1 installation. Rich >JitterBug notification > >new message incoming/55 > >Message summary for PR#55 > From: newgene@bigfoot.com > Subject: GenBank parser problem? > Date: Sat, 26 Jan 2002 13:22:55 -0500 > 0 replies 0 followups > >====> ORIGINAL MESSAGE FOLLOWS <==== > >>From newgene@bigfoot.com Sat Jan 26 13:22:55 2002 >Received: from localhost (localhost [127.0.0.1]) > by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434 > for ; Sat, 26 Jan 2002 >13:22:55 -0500 >Date: Sat, 26 Jan 2002 13:22:55 -0500 >Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org> >From: newgene@bigfoot.com >To: biopython-bugs@bioperl.org >Subject: GenBank parser problem? > >Full_Name: Chunlei Wu >Module: Bio/GenBank >Version: 1.00a4 >OS: win2000 >Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178) > > >Python version: ActivePython 2.1.1 > >Symptom: > >>>> from Bio import GenBank >>>> gi=GenBank.search_for("NM_007355")[0] >>>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) >>>> record=ncbi_dict[gi] >Traceback (most recent call last): > File "", line 1, in ? > File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__ > return self.parser.parse(handle) > File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse > self._scanner.feed(handle, self._consumer) > File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed > self._parser.parseFile(handle) > File "C:\Python21\Martel\Parser.py", line 230, in parseFile > self.parseString(fileobj.read()) > File "C:\Python21\Martel\Parser.py", line 258, in parseString > self._err_handler.fatalError(result) > File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError > raise exception >ParserPositionException: error parsing at or beyond character 55 >>>> > > >Did GenBank change the format? >Thanks. > >Chunlei > > > >_______________________________________________ >Biopython-dev mailing list >Biopython-dev@biopython.org >http://biopython.org/mailman/listinfo/biopython-dev -- ___________________________ Rich Grenyer Mammalian Evolution and Conservation Department of Biology and Biochemistry Imperial College at Silwood Park Sunningdale Berkshire SL5 7PY UNITED KINGDOM Tel: +00 44 (0)20 7594 2328 Fax: +00 44 (0)20 7594 2339 Mob: +00 44 (0)7967 632093 email: r.grenyer@ic.ac.uk WWW: http://www.bio.ic.ac.uk/evolve/people/rich ___________________________ From pewilkinson at informaxinc.com Mon Jan 28 11:15:06 2002 From: pewilkinson at informaxinc.com (Peter Wilkinson) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/55 In-Reply-To: <200201271704.g0RH4Lit020756@pw600a.bioperl.org> Message-ID: <000401c1a816$f43c8b80$c920a8c0@l001696w00> Yes, release 127 is different from 126 Peter W. > Message: 1 > Date: Sat, 26 Jan 2002 13:22:55 -0500 > From: biopython-bugs@bioperl.org > To: biopython-dev@biopython.org > Subject: [Biopython-dev] Notification: incoming/55 > > JitterBug notification > > new message incoming/55 > > Message summary for PR#55 > From: newgene@bigfoot.com > Subject: GenBank parser problem? > Date: Sat, 26 Jan 2002 13:22:55 -0500 > 0 replies 0 followups > > ====> ORIGINAL MESSAGE FOLLOWS <==== > > >From newgene@bigfoot.com Sat Jan 26 13:22:55 2002 > Received: from localhost (localhost [127.0.0.1]) > by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434 > for ; Sat, 26 Jan > 2002 13:22:55 -0500 > Date: Sat, 26 Jan 2002 13:22:55 -0500 > Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org> > From: newgene@bigfoot.com > To: biopython-bugs@bioperl.org > Subject: GenBank parser problem? > > Full_Name: Chunlei Wu > Module: Bio/GenBank > Version: 1.00a4 > OS: win2000 > Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178) > > > Python version: ActivePython 2.1.1 > > Symptom: > > >>> from Bio import GenBank > >>> gi=GenBank.search_for("NM_007355")[0] > >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) > >>> record=ncbi_dict[gi] > Traceback (most recent call last): > File "", line 1, in ? > File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in > __getitem__ > return self.parser.parse(handle) > File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse > self._scanner.feed(handle, self._consumer) > File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed > self._parser.parseFile(handle) > File "C:\Python21\Martel\Parser.py", line 230, in parseFile > self.parseString(fileobj.read()) > File "C:\Python21\Martel\Parser.py", line 258, in parseString > self._err_handler.fatalError(result) > File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError > raise exception > ParserPositionException: error parsing at or beyond character 55 > >>> > > > Did GenBank change the format? > Thanks. > > Chunlei > > > > > > --__--__-- > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest From katel at worldpath.net Wed Jan 30 05:27:09 2002 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] ECell Message-ID: <000801c1a978$ad4f2760$010a0a0a@cadence.com> I just committed ECell. ECell passes my test on DOS, at least It needs more doumentation though, so I plan to add more of an explanation. Cayte From biopython-bugs at bioperl.org Wed Jan 30 09:12:55 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/54 Message-ID: <200201301412.g0UECtit019709@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#54 from incoming to trash Message summary for PR#54 From: Subject: toner cartridges Date: Tue, 20 Nov 2001 17:48:48 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From toner@fastmail.ca Tue Nov 20 17:49:49 2001 Received: from ELIXIR.ELIXIRSOLUTIONS.NET ([64.14.239.183]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id fAKMnmA22340; Tue, 20 Nov 2001 17:49:48 -0500 Received: from unknown ([64.3.195.224] unverified) by ELIXIR.ELIXIRSOLUTIONS.NET with Microsoft SMTPSVC(5.0.2195.3779); Wed, 21 Nov 2001 04:20:10 +0530 From: Subject: toner cartridges Date: Tue, 20 Nov 2001 17:48:48 Message-Id: <736.332024.362305@unknown> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Bcc: X-OriginalArrivalTime: 20 Nov 2001 22:50:12.0328 (UTC) FILETIME=[B668B680:01C17215] **** VORTEX SUPPLIES **** YOUR LASER PRINTER TONER CARTRIDGE, COPIER AND FAX CARTRIDGE CONNECTION SAVE UP TO 30% FROM RETAIL ORDER BY PHONE:1-888-288-9043 ORDER BY FAX: 1-888-977-1577 E-MAIL REMOVAL LINE: 1-888-248-4930 UNIVERSITY AND/OR SCHOOL PURCHASE ORDERS WELCOME. (NO CREDIT APPROVAL REQUIRED) ALL OTHER PURCHASE ORDER REQUESTS REQUIRE CREDIT APPROVAL. PAY BY CHECK (C.O.D), CREDIT CARD OR PURCHASE ORDER (NET 30 DAYS). IF YOUR ORDER IS BY CREDIT CARD PLEASE LEAVE YOUR CREDIT CARD # PLUS EXPIRATION DATE. IF YOUR ORDER IS BY PURCHASE ORDER LEAVE YOUR SHIPPING/BILLING ADDRESSES AND YOUR P.O. NUMBER NOTE: WE DO NOT CARRY 1) XEROX, BROTHER, PANASONIC, FUJITSU PRODUCTS 2) HP DESKJETJET/INK JET OR BUBBLE JET CARTRIDGES 3) CANON BUBBLE JET CARTRIDGES 4) ANY OFFBRANDS BESIDES THE ONES LISTED BELOW. OUR NEW , LASER PRINTER TONER CARTRIDGE, PRICES ARE AS FOLLOWS: (PLEASE ORDER BY PAGE NUMBER AND/OR ITEM NUMBER) HEWLETT PACKARD: (ON PAGE 2) ITEM #1 LASERJET SERIES 4L,4P (74A)------------------------$44 ITEM #2 LASERJET SERIES 1100 (92A)-------------------------$44 ITEM #3 LASERJET SERIES 2 (95A)----------------------------$39 ITEM #4 LASERJET SERIES 2P (75A)---------------------------$54 ITEM #5 LASERJET SERIES 5P,6P,5MP, 6MP (3903A)---------- -$44 ITEM #6 LASERJET SERIES 5SI, 8000 (09A)--------------------$95 ITEM #7 LASERJET SERIES 2100, 2200 (96A)-------------------$74 ITEM #8 LASERJET SERIES 8100 (82X)-------------------------$115 ITEM #9 LASERJET SERIES 5L/6L (3906A)----------------------$39 ITEM #10 LASERJET SERIES 4V---------------------------------$95 ITEM #11 LASERJET SERIES 4000 (27X)--------------------------$79 ITEM #12 LASERJET SERIES 3SI/4SI (91A)-----------------------$54 ITEM #13 LASERJET SERIES 4, 4M, 5,5M-------------------------$49 ITEM #13A LASERJET SERIES 5000 (29X)-------------------------$125 ITEM #13B LASERJET SERIES 1200-------------------------------$59 ITEM #13C LASERJET SERIES 4100-------------------------------$99 ITEM #18 LASERJET SERIES 3100------------------------------$39 ITEM #19 LASERJET SERIES 4500 BLACK--------------------------$79 ITEM #20 LASERJET SERIES 4500 COLORS ------------------------$125 HEWLETT PACKARD FAX (ON PAGE 2) ITEM #14 LASERFAX 500, 700 (FX1)----------$49 ITEM #15 LASERFAX 5000,7000 (FX2)--------$64 ITEM #16 LASERFAX (FX3)------------------$59 ITEM #17 LASERFAX (FX4)------------------$54 LEXMARK/IBM (ON PAGE 3) OPTRA 4019, 4029 HIGH YIELD---------------$89 OPTRA R, 4039, 4049 HIGH YIELD-----------$105 OPTRA E310.312 HIGH YIELD----------------$79 OPTRA E-----------------------------------$59 OPTRA N----------------------------------$115 OPTRA S----------------------------------$165 OPTRA T----------------------------------$195 OPTRA E310/312---------------------------$79 EPSON (ON PAGE 4) ACTION LASER 7000,7500,8000,9000----------$105 ACTION LASER 1000,1500--------------------$105 CANON PRINTERS (ON PAGE 5) PLEASE CALL FOR MODELS AND UPDATED PRICES FOR CANON PRINTER CARTRIDGES PANASONIC (0N PAGE 7) NEC SERIES 2 MODELS 90 AND 95----------$105 APPLE (0N PAGE 8) LASER WRITER PRO 600 or 16/600------------------$49 LASER WRITER SELECT 300,320,360-----------------$74 LASER WRITER 300 AND 320------------------------$54 LASER WRITER NT, 2NT----------------------------$54 LASER WRITER 12/640-----------------------------$79 CANON FAX (ON PAGE 9) LASERCLASS 4000 (FX3)---------------------------$59 LASERCLASS 5000,6000,7000 (FX2)-----------------$54 LASERFAX 5000,7000 (FX2)------------------------$54 LASERFAX 8500,9000 (FX4)------------------------$54 CANON COPIERS (PAGE 10) PC 3, 6RE, 7 AND 11 (A30)---------------------$69 PC 300,320,700,720,760,900,910,920(E-40)------$89 90 DAY UNLIMITED WARRANTY INCLUDED ON ALL PRODUCTS. ALL TRADEMARKS AND BRAND NAMES LISTED ABOVE ARE PROPERTY OF THE RESPECTIVE HOLDERS AND USED FOR DESCRIPTIVE PURPOSES ONLY. From biopython-bugs at bioperl.org Wed Jan 30 09:13:53 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/55 Message-ID: <200201301413.g0UEDrit019728@pw600a.bioperl.org> JitterBug notification chapmanb changed notes Message summary for PR#55 From: newgene@bigfoot.com Subject: GenBank parser problem? Date: Sat, 26 Jan 2002 13:22:55 -0500 0 replies 0 followups Notes: NCBI added a new "linear" word to the LOCUS line which broke the parser here. Fixed in revision 1.17 of genbank_format.py, and tests added for this case. ====> ORIGINAL MESSAGE FOLLOWS <==== >From newgene@bigfoot.com Sat Jan 26 13:22:55 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434 for ; Sat, 26 Jan 2002 13:22:55 -0500 Date: Sat, 26 Jan 2002 13:22:55 -0500 Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org> From: newgene@bigfoot.com To: biopython-bugs@bioperl.org Subject: GenBank parser problem? Full_Name: Chunlei Wu Module: Bio/GenBank Version: 1.00a4 OS: win2000 Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178) Python version: ActivePython 2.1.1 Symptom: >>> from Bio import GenBank >>> gi=GenBank.search_for("NM_007355")[0] >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) >>> record=ncbi_dict[gi] Traceback (most recent call last): File "", line 1, in ? File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__ return self.parser.parse(handle) File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse self._scanner.feed(handle, self._consumer) File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed self._parser.parseFile(handle) File "C:\Python21\Martel\Parser.py", line 230, in parseFile self.parseString(fileobj.read()) File "C:\Python21\Martel\Parser.py", line 258, in parseString self._err_handler.fatalError(result) File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError raise exception ParserPositionException: error parsing at or beyond character 55 >>> Did GenBank change the format? Thanks. Chunlei From biopython-bugs at bioperl.org Wed Jan 30 09:13:54 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/55 Message-ID: <200201301413.g0UEDsit019730@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#55 from incoming to fixed-bugs Message summary for PR#55 From: newgene@bigfoot.com Subject: GenBank parser problem? Date: Sat, 26 Jan 2002 13:22:55 -0500 0 replies 0 followups Notes: NCBI added a new "linear" word to the LOCUS line which broke the parser here. Fixed in revision 1.17 of genbank_format.py, and tests added for this case. ====> ORIGINAL MESSAGE FOLLOWS <==== >From newgene@bigfoot.com Sat Jan 26 13:22:55 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434 for ; Sat, 26 Jan 2002 13:22:55 -0500 Date: Sat, 26 Jan 2002 13:22:55 -0500 Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org> From: newgene@bigfoot.com To: biopython-bugs@bioperl.org Subject: GenBank parser problem? Full_Name: Chunlei Wu Module: Bio/GenBank Version: 1.00a4 OS: win2000 Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178) Python version: ActivePython 2.1.1 Symptom: >>> from Bio import GenBank >>> gi=GenBank.search_for("NM_007355")[0] >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) >>> record=ncbi_dict[gi] Traceback (most recent call last): File "", line 1, in ? File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__ return self.parser.parse(handle) File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse self._scanner.feed(handle, self._consumer) File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed self._parser.parseFile(handle) File "C:\Python21\Martel\Parser.py", line 230, in parseFile self.parseString(fileobj.read()) File "C:\Python21\Martel\Parser.py", line 258, in parseString self._err_handler.fatalError(result) File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError raise exception ParserPositionException: error parsing at or beyond character 55 >>> Did GenBank change the format? Thanks. Chunlei From chapmanb at arches.uga.edu Wed Jan 30 09:26:47 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] GenBank parsing problem In-Reply-To: <200201261822.g0QIMtA00439@pw600a.bioperl.org> References: <200201261822.g0QIMtA00439@pw600a.bioperl.org> Message-ID: <20020130092647.C56210@ci350185-a.athen1.ga.home.com> Hi Chunlei; Thanks for reporting the problem (and thanks to others who verified it). > >>> from Bio import GenBank > >>> gi=GenBank.search_for("NM_007355")[0] > >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) > >>> record=ncbi_dict[gi] > Traceback (most recent call last): [...] > ParserPositionException: error parsing at or beyond character 55 > >>> > > Did GenBank change the format? Yup, it looks like they added a new "linear" word to the LOCUS line, to complement "circular" I guess: LOCUS AC091001 177066 bp DNA linear PRI 06-DEC-2001 Sorry, I'd tried to prepare for the new format changes, but hadn't realized this change was going to happen. The diff to Bio/GenBank/genbank_format.py is attached (fixes and tests for this case are also in CVS). I checked it out on a PRI download from NCBI, and it seems to be working for me. Thanks again for the report! I hope this fixes your problem. Please let me know if you have any questions. Brad -------------- next part -------------- Index: genbank_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v retrieving revision 1.16 retrieving revision 1.17 diff -c -r1.16 -r1.17 *** genbank_format.py 2002/01/05 22:09:58 1.16 --- genbank_format.py 2002/01/30 13:54:05 1.17 *************** *** 106,112 **** Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(Martel.Opt(blank_space) + ! Martel.Str("circular"))) date = Martel.Group("date", Martel.Re("[-\w]+")) --- 106,113 ---- Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(Martel.Opt(blank_space) + ! Martel.Alt(Martel.Str("circular"), ! Martel.Str("linear")))) date = Martel.Group("date", Martel.Re("[-\w]+")) From reillywu at yahoo.com Wed Jan 30 13:02:28 2002 From: reillywu at yahoo.com (Chunlei Wu) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] GenBank parsing problem In-Reply-To: <20020130092647.C56210@ci350185-a.athen1.ga.home.com> Message-ID: <20020130180228.41161.qmail@web20507.mail.yahoo.com> Hi, Brad, Thank you for your fix. But it seems I can not get the latest version of genbank_format.py from CVS. The current one is still the old one ver. 1.16. Maybe some delay of the server? Chunlei --- Brad Chapman wrote: > Hi Chunlei; > Thanks for reporting the problem (and thanks to > others who verified it). > > > >>> from Bio import GenBank > > >>> gi=GenBank.search_for("NM_007355")[0] > > >>> > ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser()) > > >>> record=ncbi_dict[gi] > > Traceback (most recent call last): > [...] > > ParserPositionException: error parsing at or > beyond character 55 > > >>> > > > > Did GenBank change the format? > > Yup, it looks like they added a new "linear" word to > the LOCUS line, to > complement "circular" I guess: > > LOCUS AC091001 177066 bp DNA > linear PRI 06-DEC-2001 > > Sorry, I'd tried to prepare for the new format > changes, but hadn't > realized this change was going to happen. The diff > to > Bio/GenBank/genbank_format.py is attached (fixes and > tests for this case > are also in CVS). I checked it out on a PRI download > from NCBI, and it > seems to be working for me. > > Thanks again for the report! I hope this fixes your > problem. Please let > me know if you have any questions. > Brad > > Index: genbank_format.py > =================================================================== > RCS file: > /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v > retrieving revision 1.16 > retrieving revision 1.17 > diff -c -r1.16 -r1.17 > *** genbank_format.py 2002/01/05 22:09:58 1.16 > --- genbank_format.py 2002/01/30 13:54:05 1.17 > *************** > *** 106,112 **** > > Martel.Opt(Martel.Alt(*residue_prefixes)) + > > Martel.Opt(Martel.Alt(*residue_types)) + > > Martel.Opt(Martel.Opt(blank_space) + > ! > Martel.Str("circular"))) > > date = Martel.Group("date", > Martel.Re("[-\w]+")) > --- 106,113 ---- > > Martel.Opt(Martel.Alt(*residue_prefixes)) + > > Martel.Opt(Martel.Alt(*residue_types)) + > > Martel.Opt(Martel.Opt(blank_space) + > ! > Martel.Alt(Martel.Str("circular"), > ! > Martel.Str("linear")))) > > date = Martel.Group("date", > Martel.Re("[-\w]+")) > __________________________________________________ Do You Yahoo!? Great stuff seeking new owners in Yahoo! Auctions! http://auctions.yahoo.com From chapmanb at arches.uga.edu Wed Jan 30 14:39:17 2002 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] GenBank parsing problem In-Reply-To: <20020130180228.41161.qmail@web20507.mail.yahoo.com> References: <20020130092647.C56210@ci350185-a.athen1.ga.home.com> <20020130180228.41161.qmail@web20507.mail.yahoo.com> Message-ID: <20020130143917.A56849@ci350185-a.athen1.ga.home.com> Hi Chunlei; > Thank you for your fix. But it seems I can not get > the latest version of genbank_format.py from CVS. The > current one is still the old one ver. 1.16. Maybe some > delay of the server? Hmm, you're right. It still is 1.16. The anonymous CVS normally syncs up with the read/write access CVS in a few hours, so maybe something is wrong with anonymous CVS (most of the administrators are having fun in the sun in Arizona right now). Anyways, until the fix moves to anonymous CVS, you can grab the changed file from: http://www.bioinformatics.org/bradstuff/bp/genbank_format-1.17.py Sorry about the pain! Brad From reillywu at yahoo.com Wed Jan 30 16:53:09 2002 From: reillywu at yahoo.com (Chunlei Wu) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] GenBank parsing problem In-Reply-To: <20020130143917.A56849@ci350185-a.athen1.ga.home.com> Message-ID: <20020130215309.88303.qmail@web20507.mail.yahoo.com> Hi, Brad, It works now on my computer. But there is something I want to point out: 1. The current release of biopython 1.00a4 doesn't include "Std.py" which needed by this new genbank_format.py. Then I updated my biopython from CVS including Martel and updated genbank_format.py to 1.17. 2. The code works fine under python shell and IDLE environment. But can not work under PythonWin's IDE which raised exactly the same error msg. I cann't figure out why. It's really strange and make me frustrated at first. But when I switched to Python Shell, it worked fine. Well,anyway, I think this is maybe the problem of PythonWin. Hope this will give people some experience who want to update genbank_format.py. Chunlei --- Brad Chapman wrote: > Hi Chunlei; > > > Thank you for your fix. But it seems I can not > get > > the latest version of genbank_format.py from CVS. > The > > current one is still the old one ver. 1.16. Maybe > some > > delay of the server? > > Hmm, you're right. It still is 1.16. The anonymous > CVS normally syncs up > with the read/write access CVS in a few hours, so > maybe something is > wrong with anonymous CVS (most of the administrators > are having fun in > the sun in Arizona right now). > > Anyways, until the fix moves to anonymous CVS, you > can grab the changed > file from: > > http://www.bioinformatics.org/bradstuff/bp/genbank_format-1.17.py > > Sorry about the pain! > Brad > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev __________________________________________________ Do You Yahoo!? Great stuff seeking new owners in Yahoo! Auctions! http://auctions.yahoo.com From biopython-bugs at bioperl.org Wed Jan 30 18:29:12 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/56 Message-ID: <200201302329.g0UNTCit023916@pw600a.bioperl.org> JitterBug notification new message incoming/56 Message summary for PR#56 From: rree@ucdavis.edu Subject: GenBank parser Date: Wed, 30 Jan 2002 18:29:11 -0500 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From rree@ucdavis.edu Wed Jan 30 18:29:12 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912 for ; Wed, 30 Jan 2002 18:29:12 -0500 Date: Wed, 30 Jan 2002 18:29:11 -0500 Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org> From: rree@ucdavis.edu To: biopython-bugs@bioperl.org Subject: GenBank parser Full_Name: Rick Ree Module: GenBank/genbank_format.py Version: CVS 1.16 OS: Linux Submission from: loco.ucdavis.edu (169.237.66.27) Genbank parser was choking on the plant flat files from NCBI's ftp site -- on the LOCUS line of the record, the parser was expecting 'circular' where my file had 'linear'. Here is a diff that fixes the problem, but the formatting is all wonky 'cos of this HTML form, sorry. -Rick Index: genbank_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v retrieving revision 1.16 diff -u -r1.16 genbank_format.py --- genbank_format.py 5 Jan 2002 22:09:58 -0000 1.16 +++ genbank_format.py 30 Jan 2002 23:31:47 -0000 @@ -106,7 +106,10 @@ Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(Martel.Opt(blank_space) + - Martel.Str("circular"))) + Martel.Str("circular")) + + Martel.Opt(Martel.Opt(blank_space) + + Martel.Str("linear")) + ) date = Martel.Group("date", Martel.Re("[-\w]+")) From biopython-bugs at bioperl.org Wed Jan 30 19:46:32 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/56 Message-ID: <200201310046.g0V0kWit024255@pw600a.bioperl.org> JitterBug notification chapmanb changed notes Message summary for PR#56 From: rree@ucdavis.edu Subject: GenBank parser Date: Wed, 30 Jan 2002 18:29:11 -0500 0 replies 0 followups Notes: Thanks Rick. I actually just fixed this bug this morning :-). Your fix is basically identical to mine. Thanks for the report; those sneaky fellas at NCBI got another change by me! ====> ORIGINAL MESSAGE FOLLOWS <==== >From rree@ucdavis.edu Wed Jan 30 18:29:12 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912 for ; Wed, 30 Jan 2002 18:29:12 -0500 Date: Wed, 30 Jan 2002 18:29:11 -0500 Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org> From: rree@ucdavis.edu To: biopython-bugs@bioperl.org Subject: GenBank parser Full_Name: Rick Ree Module: GenBank/genbank_format.py Version: CVS 1.16 OS: Linux Submission from: loco.ucdavis.edu (169.237.66.27) Genbank parser was choking on the plant flat files from NCBI's ftp site -- on the LOCUS line of the record, the parser was expecting 'circular' where my file had 'linear'. Here is a diff that fixes the problem, but the formatting is all wonky 'cos of this HTML form, sorry. -Rick Index: genbank_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v retrieving revision 1.16 diff -u -r1.16 genbank_format.py --- genbank_format.py 5 Jan 2002 22:09:58 -0000 1.16 +++ genbank_format.py 30 Jan 2002 23:31:47 -0000 @@ -106,7 +106,10 @@ Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(Martel.Opt(blank_space) + - Martel.Str("circular"))) + Martel.Str("circular")) + + Martel.Opt(Martel.Opt(blank_space) + + Martel.Str("linear")) + ) date = Martel.Group("date", Martel.Re("[-\w]+")) From biopython-bugs at bioperl.org Wed Jan 30 19:46:32 2002 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:10 2005 Subject: [Biopython-dev] Notification: incoming/56 Message-ID: <200201310046.g0V0kWit024257@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#56 from incoming to fixed-bugs Message summary for PR#56 From: rree@ucdavis.edu Subject: GenBank parser Date: Wed, 30 Jan 2002 18:29:11 -0500 0 replies 0 followups Notes: Thanks Rick. I actually just fixed this bug this morning :-). Your fix is basically identical to mine. Thanks for the report; those sneaky fellas at NCBI got another change by me! ====> ORIGINAL MESSAGE FOLLOWS <==== >From rree@ucdavis.edu Wed Jan 30 18:29:12 2002 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912 for ; Wed, 30 Jan 2002 18:29:12 -0500 Date: Wed, 30 Jan 2002 18:29:11 -0500 Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org> From: rree@ucdavis.edu To: biopython-bugs@bioperl.org Subject: GenBank parser Full_Name: Rick Ree Module: GenBank/genbank_format.py Version: CVS 1.16 OS: Linux Submission from: loco.ucdavis.edu (169.237.66.27) Genbank parser was choking on the plant flat files from NCBI's ftp site -- on the LOCUS line of the record, the parser was expecting 'circular' where my file had 'linear'. Here is a diff that fixes the problem, but the formatting is all wonky 'cos of this HTML form, sorry. -Rick Index: genbank_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v retrieving revision 1.16 diff -u -r1.16 genbank_format.py --- genbank_format.py 5 Jan 2002 22:09:58 -0000 1.16 +++ genbank_format.py 30 Jan 2002 23:31:47 -0000 @@ -106,7 +106,10 @@ Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(Martel.Opt(blank_space) + - Martel.Str("circular"))) + Martel.Str("circular")) + + Martel.Opt(Martel.Opt(blank_space) + + Martel.Str("linear")) + ) date = Martel.Group("date", Martel.Re("[-\w]+"))