From dalke at acm.org Fri Dec 1 05:06:00 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Martel-0.4 available Message-ID: <000901c05b7e$505616c0$efab323f@josiah> I've placed a copy of the Martel-0.4 distribution at http://www.biopython.org/~dalke/Martel/Martel-0.4.tar.gz The only real change between this and version 0.35 is the support for different newline conventions. New regexp syntax - \R \R means "\n|\r\n?" [\R] means "[\n\r]" New Expression Node - AnyEOL implements the \R test RecordReaders rewritten to use mxTextTools to find record begin and end characters rather than using readline/readlines. RecordReaders' __init__ and .remainder() pass around a lookahead buffer as a string rather than a list of lines. Parser.py appropriately modified. There's also a very complete regression suite for the new readers because the new code is prone to more subtle errors. (Hand-written state tables combined with hand-written/emulated continuations doesn't make for easy-to-write code.) None of the format definitions have yet been modified to use the new \R syntax. This is more meant to be trial code to see if my solution to the newline problem is appropriate. I think it does but would like feedback. Finally, I believe the API is now pretty solid for: - the regexp syntax - how to make a parser - how to make an iterator (would like a bit more feedback) - RecordReader protocol (would like feedback) This means I think we can start trying Martel for real work, as the changes will only be in the specific formats and without a global impact. Assuming there are no bugs :) BTW, I really hate being sick. Haven't been able to do much of anything requiring sustained thought for the last 3 days :( Luckily, I had mostly finished this up over the Thanksgiving weekend so it didn't require much more work. Andrew dalke@acm.org From jchang at SMI.Stanford.EDU Fri Dec 1 19:44:01 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] next release closer (?) In-Reply-To: <14886.47131.653099.144288@taxus.athen1.ga.home.com> Message-ID: On Thu, 30 Nov 2000, Brad Chapman wrote: > I would really like to have bugs sent to the dev list when they come > in -- I just noticed a couple from Iddo that I should have dealt with > (I think that is all fixed now, regardless), but didn't realize were > there. Whadda you all think about this? I thought that Jitterbug was configured to do this automatically, but now that I think about it, I submitted a report that wasn't forwarded to biopython-dev! Does anyone know how to administer Jitterbug to do this? Anyways, we do have a few bugs in the database now. Could developers please check the database and try and knock a few of these off? Jeff From thomas at cbs.dtu.dk Sat Dec 2 09:46:47 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] xbbtools + CVS Message-ID: <14889.2903.802767.427950@bb1.home> Hej, It was actually a long time since the last work on xbbtools - I dont remeber how to login on the cvs server, or has anything changed (eg my passwd) ? I tried: cvs -d pserver:thomas@cvs.biopython.org:/home/repository/biopython checkout biopython cvs -d ext:thomas@cvs.biopython.org:/home/repository/biopython checkout biopython and receive only: Permission denied. cvs [checkout aborted]: end of file from server (consult above messages if any) Suggestions ? (Is anybody working on modules for reading and writing different sequence formats ?) c ya -thomas Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From chapmanb at arches.uga.edu Sat Dec 2 09:59:49 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] xbbtools + CVS In-Reply-To: <14889.2903.802767.427950@bb1.home> References: <14889.2903.802767.427950@bb1.home> Message-ID: <14889.3685.157030.758475@taxus.athen1.ga.home.com> Hi Thomas; Nice to hear from you! [CVS] > I tried: > cvs -d pserver:thomas@cvs.biopython.org:/home/repository/biopython > checkout biopython > cvs -d ext:thomas@cvs.biopython.org:/home/repository/biopython > checkout biopython Hmmm, I think it should be: cvs -d :ext:thomas@biopython.org:/home/repository/biopython co biopython (a colon before the ext + biopython.org not cvs.biopython.org) At least, that is equivalent to what I use. When using ext, make sure you have the environmental variable CVS_RSH set to 'ssh' (or something different if your ssh executable is different). > and receive only: > Permission denied. > cvs [checkout aborted]: end of file from server (consult above messages if > any) Well, that's not a very useful error message from cvs :-). > Suggestions ? Let me know if this works (or at least gets a more helpful message from CVS). > (Is anybody working on modules for reading and writing different sequence > formats ?) Well, my project for the weekend is starting on a GenBank parser using the new stuff from Martel. I've gotten started and hope (maybe) to have something that people could test a little bit once the weekend is finished. What kind of formats are you looking to parse? Brad From thomas at genome.cbs.dtu.dk Sat Dec 2 10:20:47 2000 From: thomas at genome.cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] xbbtools + CVS In-Reply-To: <14889.3685.157030.758475@taxus.athen1.ga.home.com> Message-ID: Hej Brad, > Hmmm, I think it should be: > cvs -d :ext:thomas@biopython.org:/home/repository/biopython co thx - that worked ! > > > (Is anybody working on modules for reading and writing different sequence > > formats ?) > > Well, my project for the weekend is starting on a GenBank parser using > the new stuff from Martel. I've gotten started and hope (maybe) to > have something that people could test a little bit once the weekend > is finished. What kind of formats are you looking to parse? All of them :-) - I need it for my graphical sequence editor for reading and writing the edited sequence in different formats. (In biowish I included part of the readseq code as a shared c-library) thx -thomas Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From chapmanb at arches.uga.edu Mon Dec 4 22:32:59 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <14892.25067.641885.658966@taxus.athen1.ga.home.com> Hello all; As promised, I spent this weekend getting together a GenBank parser, which I hope is something that we could include in Biopython in the future. What I've got so far is available from: http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001204.tar.gz It has a nice distutils setup script and everything will install into Bio.PGML directory (PGML = Plant Genome Mapping Lab -> that's my little subdirectory to keep things I work on separate from Biopython). The parser uses Martel-0.4, so you'll need to have that installed to use this. Making this would definately not have been possible without all of the cool things in Martel, so we all definately have to give Andrew another big pat on the back for his awesome tool :-). It is, I hope, a full featured GenBank parser that parses things into SeqFeature classes. I'm hoping that these SeqFeature classes (or something derived from them) will be something we can include in Biopython as well. It would be really nice to have some "standard" objects for features, to help us be more compatible with the Biocorba and BioXML projects. Anyways, the parser and seq features have the following exciting features: * fully parses out Feature tables. This includes support for sub Features (ie. the exons of a CDS object). * deals with 'the dreaded fuzziness' in locations. There should be support for all of the types of fuzziness, but I've tried to not make it much more difficult to access locations if you don't care about fuzziness at all. * parses into SeqRecord objects with Seq objects that are hopefully AlphabetStrict in the proper manner. I didn't write any docs on using these yet (I've got to get to work on things for lab and school now :-), but the parsers work like other Biopython parsers like Blast (ie. with Iterators and Parsers). There are also a couple of example scripts to get things going. I'm really looking for feedback in the following areas: 1. Does this code look decent? Anyone besides me want to see this in Biopython? 2. Does this parser parse your favorite GenBank files? I've tested it on a few things, but they are mostly plant sequences, since that's what I've got around here. There is a script included in the tarball "find_parser_problems.py", which will, if you run it on a GenBank file, tell you what accession numbers, if any, cause parser problems. If you could send me lists of accession numbers that break it, it would really help to make sure it works in more cases. 3. Does the output you get have the same info as the initial GenBank file (ie -- are there any ugly bugs)? I have another script included, "check_output.py," which will spit out the parsed information to make it possible to compare it with the initial GenBank file and see if I screwed anything up. I've hand checked a couple of files, but it would really help to have other people debugging this as well. 4. What do people think about the SeqFeature classes? Like 'em? Hate 'em? Suggestions for improvement? 5. Can the code be speeded up/improved in any ways? Suggestions to help me code better are always very welcome! Thanks for listening and enjoy! Brad From jonathan.gilligan at vanderbilt.edu Tue Dec 5 00:04:23 2000 From: jonathan.gilligan at vanderbilt.edu (Jonathan M. Gilligan) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] anon cvs access? Message-ID: <5.0.1.4.0.20001204230020.022a9cd0@g.mail.vanderbilt.edu> I cannot get anonymous cvs access to check out the biopython sources. Here's a transcript. >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login (Logging in to cvs@cvs.biopython.org) CVS password: *** Fatal error, aborting. cvs: no such user cvs login: authorization failed: server cvs.biopython.org rejected access > (with "cvs" for the password, as indicated at http://cvs.biopython.org/). Can anyone help me out here? Thanks, Jonathan =========================================================================== Jonathan M. Gilligan From jchang at SMI.Stanford.EDU Tue Dec 5 02:36:50 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: <14892.25067.641885.658966@taxus.athen1.ga.home.com> Message-ID: Hi Brad, On Mon, 4 Dec 2000, Brad Chapman wrote: > Hello all; > As promised, I spent this weekend getting together a GenBank parser, > which I hope is something that we could include in Biopython in the > future. What I've got so far is available from: Great! We need a Genbank parser. > It is, I hope, a full featured GenBank parser that parses things into > SeqFeature classes. I'm hoping that these SeqFeature classes (or > something derived from them) will be something we can include in > Biopython as well. It would be really nice to have some "standard" > objects for features, to help us be more compatible with the Biocorba > and BioXML projects. Yes, I definitely agree with needing a general class. However, I've been purposefully shying away from proposing a general framework for annotations for two main reasons. First, it's a hard, unsolved problem that we don't know how to do yet. If you look at the models for biojava, bioperl, and game, you'll see that there are 3 different partially compatible solutions. I suspect how you handle annotations is going to depend on the purpose of the applications. (Though I suppose "to store genbank annotations" is a reasonable purpose). The second reason is that I like the idea of specific data structures for each database. That way, people that really care about, say, swissprot, will know how to retrieve the data from their favorite field without having to muck around to see how it's getting coerced into a one-size-fits-all framework. If you can only parse into a general data structure, then, since I don't believe a single data structure can hold all the types of information from every data base, you're bound to lose data. I don't believe there's any general data structure in existance that can handle the genbank location field. It's describe by a BNF grammar and requires a tree! > Anyways, the parser and seq features have the following exciting > features: > > * fully parses out Feature tables. This includes support for sub > Features (ie. the exons of a CDS object). > > * deals with 'the dreaded fuzziness' in locations. There should be > support for all of the types of fuzziness, but I've tried to not make > it much more difficult to access locations if you don't care about > fuzziness at all. Do we need to deal with genbank function like complement or order? > * parses into SeqRecord objects with Seq objects that are hopefully > AlphabetStrict in the proper manner. I'm not sure that's a good thing for GenBank. Does GenBank store the alphabet for the sequence? What if the sequence doesn't strictly follow the alphabet? > I didn't write any docs on using these yet (I've got to get to work on > things for lab and school now :-), but the parsers work like other > Biopython parsers like Blast (ie. with Iterators and Parsers). There > are also a couple of example scripts to get things going. > > I'm really looking for feedback in the following areas: > > 1. Does this code look decent? Anyone besides me want to see this in > Biopython? - There's a TaggingConsumer in Bio.ParserSupport. It looks like this does something similar to _PrintConsumer. It's supposed to be used for debugging purposes so that you know what's getting passed when. If it's not appropriate, please let me know how to extend it so that it's more generally useful. > 2. Does this parser parse your favorite GenBank files? I've tested it > on a few things, but they are mostly plant sequences, since that's > what I've got around here. There is a script included in the tarball > "find_parser_problems.py", which will, if you run it on a GenBank > file, tell you what accession numbers, if any, cause parser > problems. If you could send me lists of accession numbers that break > it, it would really help to make sure it works in more cases. > > 3. Does the output you get have the same info as the initial GenBank > file (ie -- are there any ugly bugs)? I have another script included, > "check_output.py," which will spit out the parsed information to make > it possible to compare it with the initial GenBank file and see if I > screwed anything up. I've hand checked a couple of files, but it would > really help to have other people debugging this as well. > > 4. What do people think about the SeqFeature classes? Like 'em? Hate > 'em? Suggestions for improvement? Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py? It would prevent stuff like: from Bio.SeqFeature import SeqFeature or even worse, from Bio.SeqFeature.SeqFeature import SeqFeature > 5. Can the code be speeded up/improved in any ways? Suggestions to > help me code better are always very welcome! Thanks for doing this! Jeff From thomas at cbs.dtu.dk Tue Dec 5 02:53:58 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Re: plans for next release Message-ID: <14892.40726.356431.482782@bb1.home> > thomas wrote > > Ok - I just came back from egypt. Of course there is no need at all for > > using posix.posixpath - thats still left from my novice days :-) > > I fix that and try to remove all Pmw widgets (its easy to implement the > > scrolled things in pure Tk) > > > Cayte wrote > > Egypt sounds like a fascinating place to visit! Late fall sounds like the > right time, too, with cooler weather. and nice snorkling too :-) > > Have you ever considered wxPython? I have a tool, SeqGui.py, in wxPython, > that's sort of like xbbtools.py I find it easier to work with than Tkinter. > Back in May, we had a thread about Gui support. Its in the archives. ################# # It seems that my reply didn't make it through sendmail :-( - I try to # reconstruct ... ######### The main reasons for my sticking to Tkinter are the fact that I have used Tcl/Tk a lot before I discovered python - I have tons of Tk snippets from my previous bioinformatic work (Biowish, GRS, XBbtools, CapDB etc.) which is very easy to convert into shorter, cleaner and more efficient python Tk code. Maybe the biggest advantage in using Tkinter is the powerful Tk Canvas, as far as I know neither wxPython or Gtk python have anything close to the canvas widget. > > I'd like the gui to eventually support color highlighting of features, for > example, regions of high consensus. > I don't know how this works in wxPython, but in Tkinter it is already there from the beginning. Every line, rectangle etc. you draw in the canvas is an unique object and gets an id. You can very easy bind any event (e.g. MouseOver, DoubleClickButton1 etc.) to any function. To highlight different genes or sequence regions is just to group the according id's and bind a color-change on a MouseOver event. e.g. my recently accepted paper about phylogenomics with python (NAR nr2 2001) deals with the interactive display of all genes, phylogenetic trees, blast results for a microbial genome (between 1000 and 5000 times 3). I have no fancy webpage yet but you can check a screenshot of the phylome of the Bacteria Thermotoga maritima at http://www.cbs.dtu.dk/thomas/pyphy/pyphy.png (Phylome = set of all phylogenetic trees for a genome. color coding for the kingdom of the closest neighbor in the phylogenetic tree: blue = Bacteria, yellow = Archaea, red = Eukarya) Here the phylome map is an interactive display of all phylogenetic trees and genes (colored lines in the circle), where each line/gene is sensitive to mouse movement. A MouseOver event displays gene information in the top Entry, Button1Click shows the phylogenetic tree, Button3 shows a gene specific popupmenu for blastresults, alignments etc. Each gene can be a member of a metabolic pathway, where selecting a pathway in the right listbox changes the width and the arrow shape of each gene associated (canvas tag) with the pathway. The advantage here is zooming, resizing, moving and event grabbing is part of the canvas widget so we only need to redraw single objects. I have never worked with wxPython - what is exactly the strength of wxWindows ? I guess it is faster than Tkinter, are there any special features not found in the rest of the GUI family ? c ya -thomas Sicheritz Ponten Thomas E. CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From dag at sonsorol.org Tue Dec 5 07:46:25 2000 From: dag at sonsorol.org (chris dagdigian) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] anon cvs access? In-Reply-To: References: <5.0.1.4.0.20001204230020.022a9cd0@g.mail.vanderbilt.edu> Message-ID: <5.0.2.1.0.20001205074258.00a8ed40@fedayi.sonsorol.org> Hey folks, I happened to break anonymous CVS as a side effect of installing the web cvsview CGIs on Sunday afternoon. Sorry about that. As partial repayment for the inconvenience, the biopython repository is now browsable at http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=biopython Anon cvs should be working again, I just had to rebuild the CVS passwd and readers file. Drop me a line if anyone continues to have any troubles. Regards, Chris At 11:39 PM 12/4/00 -0800, Jeffrey Chang wrote: >There's been some reports of it failing for the other projects as >well. I'm forwarding your email to Chris Dagdigian to see if he knows >what's going on. > >Jeff > > >On Mon, 4 Dec 2000, Jonathan M. Gilligan wrote: > > > I cannot get anonymous cvs access to check out the biopython sources. > > Here's a transcript. > > > > >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login > > (Logging in to cvs@cvs.biopython.org) > > CVS password: *** > > Fatal error, aborting. > > cvs: no such user > > cvs login: authorization failed: server cvs.biopython.org rejected access > > > > > > > (with "cvs" for the password, as indicated at http://cvs.biopython.org/). > > Can anyone help me out here? From bcohen at cs.sunysb.edu Tue Dec 5 15:39:53 2000 From: bcohen at cs.sunysb.edu (barry cohen) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] grammar for .ffn files In-Reply-To: <200012051704.MAA20834@pw600a.bioperl.org> Message-ID: Is there an official document specifying the syntax for the defline of a .ffn file? barry cohen From katel at worldpath.net Wed Dec 6 01:40:50 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] Re: plans for next release References: <14892.40726.356431.482782@bb1.home> Message-ID: <001d01c05f4f$7a51a500$010a0a0a@cadence.com> > and nice snorkling too :-) > > Do you have pictures of Egypt to post on the web? > > Have you ever considered wxPython? I have a tool, SeqGui.py, in wxPython, > > that's sort of like xbbtools.py I find it easier to work with than Tkinter. > > Back in May, we had a thread about Gui support. Its in the archives. > > > The main reasons for my sticking to Tkinter are the fact that I have used > Tcl/Tk a lot before I discovered python - I have tons of Tk snippets from > my previous bioinformatic work (Biowish, GRS, XBbtools, CapDB etc.) which > is very easy to convert into shorter, cleaner and more efficient python Tk > code. Maybe the biggest advantage in using Tkinter is the powerful Tk > Canvas, as far as I know neither wxPython or Gtk python have anything close > to the canvas widget. > I think the Windows version is a wrapper around the Windows Gui and that wxPython attempts to provide equivalent functionality in Linux. > > > > I'd like the gui to eventually support color highlighting of features, for > > example, regions of high consensus. > > > > I don't know how this works in wxPython, but in Tkinter it is already > there from the beginning. Every line, rectangle etc. you draw in the > canvas is an unique object and gets an id. You can very easy bind any event > (e.g. MouseOver, DoubleClickButton1 etc.) to any function. To highlight > different genes or sequence regions is just to group the according id's and > bind a color-change on a MouseOver event. > e.g. my recently accepted paper about phylogenomics with python (NAR nr2 > 2001) deals with the interactive display of all genes, phylogenetic > trees, blast results for a microbial genome (between 1000 and 5000 times > 3). > I have no fancy webpage yet but you can check a screenshot of the > phylome of the Bacteria Thermotoga maritima > at http://www.cbs.dtu.dk/thomas/pyphy/pyphy.png > (Phylome = set of all phylogenetic trees for a genome. > color coding for the kingdom of the closest neighbor in the phylogenetic > tree: blue = Bacteria, yellow = Archaea, red = Eukarya) > > Here the phylome map is an interactive display of all phylogenetic trees > and genes (colored lines in the circle), where each line/gene is sensitive > to mouse movement. A MouseOver event displays gene information in the top > Entry, Button1Click shows the phylogenetic tree, Button3 shows a gene > specific popupmenu for blastresults, alignments etc. > Each gene can be a member of a metabolic pathway, where selecting a pathway > in the right listbox changes the width and the arrow shape of each gene > associated (canvas tag) with the pathway. > > The advantage here is zooming, resizing, moving and event grabbing is part > of the canvas widget so we only need to redraw single objects. > Does it support colorization with enough flexibility, to support research on the fly, as in this scenario? USER STORY: Ed Enzyme is doing some detective work on an alignment. First he highlights the start and stop codons in red and green. Then Ed zooms in on an interesting sequence. He first highights the hydrophilic regions in magenta. Then Ed backtracks and highlights the acidic regions. > > I have never worked with wxPython - what is exactly the strength of > wxWindows ? I guess it is faster than Tkinter, are there any special > features not found in the rest of the GUI family ? > > I found it was easier to work with. With wxPython I could write more code in the same time and fewer problems, like panels that don't quite line up. Cayte From katel at worldpath.net Wed Dec 6 03:28:59 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] GenBank parser -- first go References: Message-ID: <003901c05f5e$9622c700$010a0a0a@cadence.com> ----- Original Message ----- From: "Jeffrey Chang" To: "Brad Chapman" Cc: Sent: Monday, December 04, 2000 11:36 PM Subject: Re: [Biopython-dev] GenBank parser -- first go > Hi Brad, > > On Mon, 4 Dec 2000, Brad Chapman wrote: > > > Hello all; > > As promised, I spent this weekend getting together a GenBank parser, > > which I hope is something that we could include in Biopython in the > > future. What I've got so far is available from: > Does it strip html tags? When I ran checkoutput.py, it produced this output. C:\gb_parser-20001204\Scripts>python check_output.py nutmeg.htm Traceback (most recent call last): File "check_output.py", line 25, in ? iterator = GenBank.Iterator(handle, parser) File "c:\biopyt~1.90d\Bio\PGML\GenBank\GenBank.py", line 57, in __init__ self._reader = RecordReader.StartsWith(handle, "LOCUS") File "c:\biopyt~1.90d\Martel\RecordReader.py", line 130, in __init__ self.tagtable) File "c:\biopyt~1.90d\Martel\RecordReader.py", line 89, in _find_begin_positio ns raise ReaderError("invalid format starting with %s" % repr(text[:50])) Martel.RecordReader.ReaderError: invalid format starting with ' Message-ID: Awesome! Thanks for doing this, Chris. I've put links to it from the biopython web pages. jeff On Tue, 5 Dec 2000, chris dagdigian wrote: > > Hey folks, > > I happened to break anonymous CVS as a side effect of installing the web > cvsview CGIs on Sunday afternoon. Sorry about that. > > As partial repayment for the inconvenience, the biopython repository is now > browsable at > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=biopython > > Anon cvs should be working again, I just had to rebuild the CVS passwd and > readers file. Drop me a line if anyone continues to have any troubles. > > Regards, > Chris > > > At 11:39 PM 12/4/00 -0800, Jeffrey Chang wrote: > >There's been some reports of it failing for the other projects as > >well. I'm forwarding your email to Chris Dagdigian to see if he knows > >what's going on. > > > >Jeff > > > > > >On Mon, 4 Dec 2000, Jonathan M. Gilligan wrote: > > > > > I cannot get anonymous cvs access to check out the biopython sources. > > > Here's a transcript. > > > > > > >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login > > > (Logging in to cvs@cvs.biopython.org) > > > CVS password: *** > > > Fatal error, aborting. > > > cvs: no such user > > > cvs login: authorization failed: server cvs.biopython.org rejected access > > > > > > > > > > (with "cvs" for the password, as indicated at http://cvs.biopython.org/). > > > Can anyone help me out here? > > From chapmanb at arches.uga.edu Wed Dec 6 02:21:38 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:54 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: References: <14892.25067.641885.658966@taxus.athen1.ga.home.com> Message-ID: <14893.59650.80411.133478@taxus.athen1.ga.home.com> Hey Jeff, Thanks a lot for taking a look at the code! [SeqFeature classes] > Yes, I definitely agree with needing a general class. However, I've been > purposefully shying away from proposing a general framework for > annotations for two main reasons. First, it's a hard, unsolved problem > that we don't know how to do yet. If you look at the models for biojava, > bioperl, and game, you'll see that there are 3 different partially > compatible solutions. I suspect how you handle annotations is going to > depend on the purpose of the applications. (Though I suppose "to store > genbank annotations" is a reasonable purpose). Agreed. I think our chances of getting it perfect are pretty slim :-). However, I think it would really help writing "applications that use Biopython" to have some kind of general class to work off of (even if it is imperfect). It is just too much work to have to support Genbank record classes and EMBL record classes and whatever record classes. I imagine this would be especially painful for writing user interfaces. I'm not sure if my SeqFeature classes are the best thing ever, but it is just meant as a bit of a start. I am definately willing to throw out/ammend lots of what I wrote if people have got ideas for changing them to be better. > The second reason is that > I like the idea of specific data structures for each database. That way, > people that really care about, say, swissprot, will know how to retrieve > the data from their favorite field without having to muck around to see > how it's getting coerced into a one-size-fits-all framework. Also agreed. I'll work on a GenBank specific Record class as well, as I think these make it much easier for people who just want to parse out GenBank. > If you can > only parse into a general data structure, then, since I don't believe a > single data structure can hold all the types of information from every > data base, you're bound to lose data. I don't believe there's any general > data structure in existance that can handle the genbank location > field. It's describe by a BNF grammar and requires a tree! Very true :-). When putting the "GenBank specific stuff" into the SeqFeature classes I ended up dumping a lot of it into dictionaries (like Andrew's annotations dictionary in the SeqRecord class). A GenBank specific record would definately hold the information in a lot more readily accessible format. > Do we need to deal with genbank function like complement or order? I'm trying to deal with them (although I forgot to do order! Thanks for mentioning it). I'm dealing with them in the following way: complement - I mark the feature as being on the opposite strand (I'm using a 1, 0, -1 scale like BioCorba -- so -1 is the opposite strand). join - The top level feature has a location from the start to the end of the join. The feature then has sub SeqFeatures (also borrowed from Biocorba) which are the individual exons (or whatever) in the join. Right now if the top level feature is a CDS, then the sub Features are labelled as type CDS_span. I think I'll change this to CDS_join to make it clear they are part of a join. order - I'll treat this like join, except call the sub features CDS_order. It should also be able to deal with nested locations, like complement(join(location,location)). Does this all sound reasonable? [Alphabets for GenBank] > I'm not sure that's a good thing for GenBank. Does GenBank store the > alphabet for the sequence? What if the sequence doesn't strictly follow > the alphabet? Well, GenBank doesn't really store the alphabet (it does give a base count for common bases (AGTC) but then specifies anything else as "other" which isn't very useful for our purposes). What I do is remember the type from the GenBank file (DNA, RNA, PROTEIN) and then give the sequence this alphabet. I use the ambiguous DNA and RNA alphabets so this should cover any letters in the sequence (hopefully). I'm not sure if this is ideal, but at least it associates the type with the sequence. Suggestions about how to be more strict are welcome on this. > - There's a TaggingConsumer in Bio.ParserSupport. It looks like this does > something similar to _PrintConsumer. It's supposed to be used for > debugging purposes so that you know what's getting passed when. If it's > not appropriate, please let me know how to extend it so that it's more > generally useful. Oh sorry, I meant to document that. TaggingConsumer is great. I just used the PrintConsumer as I was coding this so that I would make sure I added all of the necessary callbacks and didn't forget any information. This was more for my use in coding then for anything else. Once I was done building the parser, I just copy/pasted it and used it to build the Feature consumer. I don't think it is worthwhile to actually include in a final version, but I saved it because I'll probably need to copy and paste it again to write a RecordConsumer. But anyways, it was just a coding tool -- for later debugging TaggingConsumer is great for me. PrintConsumer was just my way to reduce the number of bugs in my code :-). > Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py? > It would prevent stuff like: > from Bio.SeqFeature import SeqFeature > or even worse, > from Bio.SeqFeature.SeqFeature import SeqFeature Well, I was going to recommend to use it like this: from Bio import SeqFeature my_feature = SeqFeature.SeqFeature.SeqFeature() :-) Seriously, this is indeed very ugly. Another possible solution would be to put all of the Features in a directory called Feature or Features (instead of SeqFeature) so then the imports would look like: from Bio.Feature import SeqFeature Either way is fine, though (or I'm very open to additional suggestions), so whatever you think. Thanks again for taking a look at this. I'll try to produce another version based on this (and any future comments) for next week. Brad From chapmanb at arches.uga.edu Wed Dec 6 02:31:36 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: <003901c05f5e$9622c700$010a0a0a@cadence.com> References: <003901c05f5e$9622c700$010a0a0a@cadence.com> Message-ID: <14893.60248.442490.847078@taxus.athen1.ga.home.com> Hi Cayte; Thanks for trying this out! [GenBank parser] > Does it strip html tags? When I ran checkoutput.py, it produced this > output: [the parser doesn't like html] > The problem with conversions to text is that Netscape and Explorer and > probably others use different algorithms and produce different text output. GenBank is a flat file format, like FASTA, so all of the html markup that NCBI or whoever puts in is just arbitrary to "beautify" it for the web. You should be able to get the text GenBank version of any record without having to do a "save as text" on an html page. On the NCBI page, there is a Text button at the top of a list of records that will give you the flat-file text version of a record you searched for using Entrez. You can then save this as text, and it'll be consistent between browsers. Once you get this the parser should be happier with the file :-). Let me know if this doesn't help. Brad From dalke at acm.org Wed Dec 6 03:10:36 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <001401c05f5c$056e0460$95ac323f@josiah> Brad: >I spent this weekend getting together a GenBank parser, >which I hope is something that we could include in Biopython in the >future. Wow! I'm glad that people other than me can use it. I've been working on it for so long now that I don't have a good idea of what it means to come into it from scratch. >we all definately have to give >Andrew another big pat on the back for his awesome tool :-). Thank you. >5. Can the code be speeded up/improved in any ways? Suggestions to >help me code better are always very welcome! I look over it and it seems quite good. I do have some comments, which I've included here. (Most of the points are relevant to Python and Martel programming, so I've sent it to the list instead of just you directly.) You use cur_record = iterator.next() while cur_record: ... cur_record = iterator.next() The standard Python idiom is while 1: cur_record = iterator.next() if not cur_record: break ... Instead of indent_space = Martel.RepN(Martel.Str(" "), 2) it is better to do (note: two spaces) indent_space = Martel.Str(" ") They give the same result, but " " gets checked with a single test while two " "s is done with two tests. This is even more appropriate with qualifier_space = Martel.RepN(Martel.Str(" "), 21) (Actually, there is a Martel.optimize module which contains the function 'merge_strings'. It should merge " "," " into " " but it isn't automatically used.) As an aside, I can see I've been focused on regexps compared to you. You have blank_space = Martel.Rep1(Martel.Str(" ")) where I would use blank_space = Martel.Re(" *") There's no difference in implementation - the decision on which one to use is based on readability/usability. I (sadly?) know a lot about regexps so I probably find them more usable :) Case in point, I probably would have written the LOCUS definition with more regexps. def choice(tag, words): exp_list = map(Martel.Str, words) return Martel.Group(tag, exp_list) residue_types = choice("residue_type", ["DNA", "RNA", "mRNA", "PROTEIN"]) data_file_division = choice("data_file_division", ["PRI", "ROD", "MAM", "VRT", "INV", "PLN", "BCT", "RNA", "VRL", "PHG", "SYN", "UNA", "EST", "PAT", "STS", "GSS", "HTG"]) date = Martel.Group("date", Martel.Re("(?P\d+)-(?P[A-Z]+)-(?P\d+))) locus_line = Martel.Re("LOCUS +(?P\w+) +(?P\d+) bp +") + \ residue_types + blank_space + data_file_division + \ blank_space + date + Martel.AnyEol() Interestingly, I didn't use a regexp for residue_types like you do. That's because I worry about people looking at a list of strings and thinking they can add arbitrary text - forgetting about escapeing regex characters. Skipping ahead, the 'valid_f_keys' might someday include characters like '+' and '.', so you really should use a function like the one I gave above. Hmmm. I've been pickier than you about ignoring the leading whitespace. For example, you have # definition line # DEFINITION Genomic sequence for Arabidopsis thaliana BAC T25K16 from # chromosome I, complete sequence. definition = Martel.Group("definition", Martel.Rep1(blank_space + Martel.ToEol())) definition_line = Martel.Group("definition_line", Martel.Str("DEFINITION") + definition) By comparison, I would enforced that the text be folded to start on column 13 using definition_line = Martel.Group("definition_line", Martel.Str("DEFINITION ") + Martel.ToEol("definition") + \ Martel.Rep(Martel.Str(" ") + Martel.ToEol("definition"))) Here's a justification for this. It's already common practice with GenBank files to have subitems indented under the major item. For example, SOURCE thale cress. ORGANISM Arabidopsis thaliana Suppose some day the powers that be add a subitem to the DEFINITION DEFINITION Genomic sequence for Arabidopsis thaliana BAC T25K16 from chromosome I, complete sequence. BLAH Abcd Ef Ghijkl. I consider it a good thing for the parser to break at this point, rather than include the "BLAH Abcd Ef Ghijkl." text as part of the definition. It seems the form of "LABEL" text starts on column 13 and may fold over multiple lines is pretty common. If that's the case, you can make things simpler by using a functions to make the definition. INDENT = 12 def make_line(label, line_name, data_name): assert len(label) < INDENT, "label text too long" first_line = Martel.Str(label + " " * (INDENT - len(label))) + \ Martel.ToEol(data_name) other_lines = Martel.Str(" " * INDENT) + \ Martel.ToEol(data_name) return Martel.Group(line_name, first_line + Rep(other_lines)) definition_line = make_line("DEFINITION", "definition_line", "definition") accession_line = make_line("ACCESSION", "accession_line", "accession") (If you were really trusting you could use just the line label and string.lower it to get the data name, and add "_line" to that to get the line name. I'm usually more explicit than that.) \w includes \d so you don't need to do [\w\d] for the nid For that matter, [\d]+ is the same as \d+ (as in the gi) I'm still undecided about the AtBeginning and AtEnd commands. (You use the former in the version_line definition.) I don't know if they should test for beginning/end of line or beginning/end of input text. In fact, I would rather not have them at all, since they don't work well with the RecordReader idea, where the text is broken up into parts. There's no real need for AtBeginning here so you can remove it. With the function definition above, you can also replace keywords_line = make_line("KEYWORDS", "keywords_line", "keywords") source_line = make_line("SOURCE", "source_line", "source") Out of curiosity, why do arab1.gb and cor6_6.gb have different ORGANISM lines? SOURCE thale cress. ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; Rosidae; Capparales; Brassicaceae; Arabidopsis. SOURCE thale cress. ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis. The first has "Streptophyta", "euphyllophytes" and "Capparales". The second has "core eudicots", "eurosids II" and "Brassicales". Martel includes a helper function called 'Integer' which can simplify some of your definitions, as with reference_num = Martel.Integer("reference_num") pubmed_id = Martel.Integer("pubmed_id") Here's another preference of mine. You have sequence = Martel.Group("sequence", Martel.Re("[\w ]+")) where I would do sequence = Martel.ToEol("sequence") The difference is that you only accept a-zA-Z0-9_ and the space character. It doesn't accept "-" or "*", which I can see possibly getting into the data. Since there's no need to validate the characters, might as well just consume anything you find. (Note to self - need to use \R for the ToEol definition.) Looking at GenBank.py, you use the record definition thusly: parser = genbank_format.record.make_parser(debug_level = debug) parser.setContentHandler(_EventGenerator(consumer)) parser.setErrorHandler(handler.ErrorHandler()) parser.parseFile(handle) You really should cache the created parser. It can take quite some time to generate. During my PIR tests, about 98% of time of some of the tests were spent doing generation rather than parsing. I think you do it because you want to allow different debug levels. You can still support that in a couple of ways. Here's one: class _Scanner: def __init__(self): self._cached_parsers = {} def feed(self, handler, consumer, debug = 0) parser = self._cached_parsers.get(debug) if parser is None: parser = self._cached_parsers[debug] = \ genbank_format.record.make_parser(debug_level = debug) parser.setContentHandler(_EventGenerator(consumer)) parser.setErrorHandler(handler.ErrorHandler()) parser.parseFile(handle) Here's another which is tuned for smaller memory use and the assumption that you almost never change debug levels. class _Scanner: def __init__(self): self._cached_parser = None self._cached_debug = None def feed(self, handler, consumer, debug = 0) if self._cached_debug == debug: parser = self._cached_parser else: parser = self._cached_parser = \ genbank_format.record.make_parser(debug_level = debug) self._cached_debug = debug parser.setContentHandler(_EventGenerator(consumer)) parser.setErrorHandler(handler.ErrorHandler()) parser.parseFile(handle) You have a list of tags you are interested in receiving. Martel has a function to create a new expression tree based but only sending back the events you are interested in receiving. expression = genbank_format.record expression = Martel.select_names(expression, interest_tags) parser = expression.make_parser() BTW, this was implemented because Python's function call overhead is pretty large so I wanted a way to reduce the number of calls if I knew an event wasn't needed. Replace fun_to_call = eval("self._consumer" + "." + name) with fun_to_call = getattr("self._consumer", name) You should also do fun_to_call(info_to_pass) instead of apply(fun_to_call, (info_to_pass,)) Here's a cute implementation for _PrintConsumer (untested) # Let's you define new labels if you want them event_name_converter = { "start_feature_table": "Starting feature table", "record_end": "End of Record!", } class _PrintWrapper: def __init__(self, name): self.name = name def __call__(self, content): print "%s: %s" % (self.name, content) class _PrintConsumer: def __init__(self): self.data = 'blah' def __getattr__(self, name): if name[:2] == "__": raise AttributeError, name return _PrintWrapper(event_name_converter.get(name, name)) Andrew dalke@acm.org From dalke at acm.org Wed Dec 6 03:12:29 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] bug in Martel-0.4 Message-ID: <001501c05f5c$47ecb8e0$95ac323f@josiah> There's a bug in Martel-0.4 and earlier versions. Suppose you have ([<>][ABC])+[<>]? and want to match it against ][ABC]. The "][ABC]. The parser tries to match the final "<" against [<>][ABC] and should fail then try to match the "<" against [<>]? . The bug was that it would match the "<" against the [<>] in [<>][ABC] and fail at that point. It gives an assertion error about "l" being greater than "r". Here's the patch. The only consequence should be a small hit in performance. Index: Generate.py =================================================================== RCS file: /home/dalke/cvsroot/Martel/Generate.py,v retrieving revision 1.18 retrieving revision 1.19 diff -r1.18 -r1.19 271c271 < result.append( (None, TT.SubTable, tuple(tagtable)) ) --- > result.append( (">ignore", TT.Table, tuple(tagtable)) ) 275c275 < result.append( (None, TT.SubTable, tuple(tagtable), --- > result.append( (">ignore", TT.Table, tuple(tagtable), (Okay, there are other bugs, but this is one which is part of the core code and is hard to figure out or work around.) Andrew From dalke at acm.org Wed Dec 6 03:16:16 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] PIR parsing Message-ID: <001601c05f5c$cf21c260$95ac323f@josiah> I've written a much more complete PIR CODATA parser which works with the latest PIR release (Release 66.00, September 30, 2000). I tested it against pir1.dat and pir3.dat. The PIR format is somewhat nasty, but not as bad as I thought it would be. It's like several other formats in that long fields fold over to the next (indented) lines. The only major problem was that the folded lines themselves can contain multiple elements, like FEATURE 2-105 #product cytochrome c #status experimental #label MAT\ or in XML with some extra newlines ... :) 2-105 #product cytochrome c #status experimental #label Some of the fields don't have the #elements at all, but the implementation is pretty strict and it checks that words inside of the text field do not start with a '#'. That check makes the pattern quite gnarly but is needed to ensure I'm not missing an element by accident. The new module is (temporarily) at http://www.biopython.org/~dalke/PIR_3_0.py . It should work fine, but it hasn't been tested against a real need (like generating HTML or data structures) so will likely changed as those needs are resolved. Also, the indentation level has changed from release 65 so it probably won't work with anything other than the most recent version. Some things to do for the future: o rewrite to clean things up, now that the format is known (some of the definitions are scaffolding to explore the format) o choose better names o parse more of the format - identify parts of the journal references - make each component accessible in a semi-colon delimited list BTW, the callback overhead for this format is about a factor of 4 more than the parsing part. The PIR format intermingles sequence letters and markup about the residue - one letter of one then one letter of the other. So every sequence character creates three function calls! (begin, character and end.) Andrew dalke@acm.org From dalke at acm.org Wed Dec 6 03:39:45 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <001e01c05f60$16d48720$95ac323f@josiah> Jeff: >> I don't believe there's any general >> data structure in existance that can handle the genbank location >> field. It's describe by a BNF grammar and requires a tree! Speaking as a parsing problem, this cannot be done with regular expression. When something like that occurs, it should be fine to leave it as an opaque block of text, which is parsed elsewhere. John Aycock wrote a really nice context-free parser in pure Python called SPARK. http://www.csr.uvic.ca/~aycock/python/ Easier to use. (Which means it is *much* easier to use than lax/yacc.) Brad: >I use the ambiguous DNA and RNA >alphabets so this should cover any letters in the sequence >(hopefully). I'm not sure if this is ideal, but at least it associates >the type with the sequence. Suggestions about how to be more strict >are welcome on this. You could be more strict by being less strict. There's a ProteinAlphabet, DNAAlphabet and RNAAlphabet as part of the Bio.Alphabet module. You can't really do anything with them. All they say is that sequence contains a single letter of alphabet containing protein, dna or rna residues. It doesn't attempt to define what those letter means. Jeff: >> - There's a TaggingConsumer in Bio.ParserSupport. Oops! You can see I haven't read that bit of code. I included something pretty much like that in my earlier reply to Brad. Andrew dalke@acm.org From chapmanb at arches.uga.edu Wed Dec 6 03:50:55 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Martel-0.4 available In-Reply-To: <000901c05b7e$505616c0$efab323f@josiah> References: <000901c05b7e$505616c0$efab323f@josiah> Message-ID: <14893.65007.146784.658490@taxus.athen1.ga.home.com> Hi Andrew; Sorry I haven't had a chance to comment on new Martel features yet -- I have a bit of feedback in the areas you mentioned based on working with it for writing the GenBank parser. > New regexp syntax - \R > \R means "\n|\r\n?" > [\R] means "[\n\r]" > > New Expression Node - AnyEOL > implements the \R test In general, the \R syntax worked great for me. I'm not a regexp purist or anything, so I have no issues with adding this. The new feature of being able to handle any kind of line feed is very nice. One thing that I ended up doing was not using the AnyEOL test at all, and instead only using the \R syntax. As I starting using it I realized why it was so nice to be able to embed the \R inside of any regular expression, so I ended up only using \R to be consistent (so I used Martel.Re("\R") to detect end of lines. Just thought I would mention it if it helpful to you. But in general, \R seems great by me. I also thought it would be nice if the RecordReader would accept \R as a newline as well, so you could do something like RecordRecorder.EndsWith(handle, "//\R"). Even further along these lines, it would have been nice to be able to set the end with an arbitrary regular expression. For GenBank, I would have wanted "//[\R]+" (okay, I would have to escape those //'s, but I'm not sure how many /s that would leave me with :-), so that the end would be // plus an arbitrary number of newlines. I ran into problems with files like the biojava genbank test file, where there are a bunch of linefeeds at the end of the file, but this could be a problem with a file of cut'n'pasted records that had differing amounts of linebreaks. I was able to get around this for GenBank by using StartsWith(handle, "LOCUS"), but just thought I would mention the thought. > RecordReaders rewritten to use mxTextTools to find record > begin and end characters rather than using readline/readlines. I have a quick question about mxTextTools importing -- you are now importing with: from mx import TextTools When did it get a mx meta-directory? Is this a new version or anything fancy? It was no big deal, I was just curious. > - how to make an iterator (would like a bit more feedback) (pausing to read your other mails right now... thanks for the feedback!) One thing that I didn't use is a Martel based iterator -- I just stuck with the type of iterator that Jeff uses in other Biopython parsers but used the RecordReader to implement it. I'm not sure if it could be done in a better way with a Martel iterator... BTW, the debug_level = 2 option on the parser is incredibly nice. It really helps get at why a parse is failing and makes it much easier to correct the problem. I probably would still be pulling my hair out trying to regexp right without this. Thanks! Brad From dalke at acm.org Wed Dec 6 04:53:23 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Martel-0.4 available Message-ID: <006101c05f6a$607c39e0$95ac323f@josiah> Brad: >One thing >that I ended up doing was not using the AnyEOL test at all, and >instead only using the \R syntax. Admittedly, the exising AnyEol uses the old "\n" test so won't work on non-UNIX platforms. Still, AnyEol() should be just as good as using Re(r"\R"). (Partially because you really should be using a raw quoted string - it works because Python's normal strings currently don't do anything with \R.) >I also thought it would be nice if the RecordReader would accept \R as >a newline as well, so you could do something like >RecordRecorder.EndsWith(handle, "//\R"). Some of the readers allow a trailing "\n". This gets interpreted to mean \R. That changes the definition of "\n", which is probably a bad idea. I used "\n" because it's one character and not as likely to be confused with other characters. It shouldn't be too hard to change to use \R instead. > Even further along these >lines, it would have been nice to be able to set the end with an >arbitrary regular expression. Indeed, that would be a final goal for Martel. I can't do it. If I could then your delimiter would be the Genbank record definition itself and there would be no need for a RecordReader. The problem is that I can't tell when mxTextTools reaches the end of the string. I would like it to ask "I've parsed this data, got any more before I call it the end of input?". All I know now is that the parse failed, but it could be because the text was in the wrong format or it needed more data to finish the check. I could keep on making the string larger and larger, but when would I stop? BTW, that "make the string larger and larger" is what I do with the StartsWith and EndsWith. That only works because I know exactly the contents of the string so I know the failure conditions, and because the record sizes are usually a lot smaller than the lookahead buffer so I don't have the N**2 case of appending strings and retesting. >I ran into problems with >files like the biojava genbank test file, where there are a bunch of >linefeeds at the end of the file, but this could be a problem with a >file of cut'n'pasted records that had differing amounts of >linebreaks. If you use the HeaderFooter parser, you have an empty header and a footer which matches "\R*". See the PIR example which allows a trailing \\\ . When it reads past the final /// it will try to parse the newlines as a record. That will fail, so it passes the text off to the footer parser. Another nice thing about the Record Parsers - if there's an error when processing a record, it's an 'error' but not a 'fatalError'. It can recover by processing the next record. >I have a quick question about mxTextTools importing -- you are now >importing with: > >from mx import TextTools > >When did it get a mx meta-directory? Is this a new version or anything >fancy? It was no big deal, I was just curious. Oops, didn't realize I was doing that. I'm using a prerelease version of mxTextTools 1.2 which changes the organization. I really should use just TextTools. (1.2 has backwards compatible support for that.) >One thing that I didn't use is a Martel based iterator -- I just stuck >with the type of iterator that Jeff uses in other Biopython parsers >but used the RecordReader to implement it. I'm not sure if it could be >done in a better way with a Martel iterator... Depends on the needs. From what I saw of your adapter, it was pretty straight match between the two. >BTW, the debug_level = 2 option on the parser is incredibly nice. It >really helps get at why a parse is failing and makes it much easier to >correct the problem. I probably would still be pulling my hair out >trying to regexp right without this. Thanks! I agree. I was working on the PIR parser and having the correct byte position (debug_level = 1) was wonderful. Then when I got really confused, I upped it to 2 to get an idea of what it was attempting to parse. Andrew dalke@acm.org From katel at worldpath.net Wed Dec 6 23:57:49 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go References: <003901c05f5e$9622c700$010a0a0a@cadence.com> <14893.60248.442490.847078@taxus.athen1.ga.home.com> Message-ID: <003601c0600a$42b56ee0$010a0a0a@cadence.com> ----- Original Message ----- From: "Brad Chapman" To: "Cayte" Cc: Sent: Tuesday, December 05, 2000 11:31 PM Subject: Re: [Biopython-dev] GenBank parser -- first go > > You should be able to get the text GenBank version of any record > without having to do a "save as text" on an html page. On the NCBI > page, there is a Text button at the top of a list of records that > will give you the flat-file text version of a record you searched > for using Entrez. You can then save this as text, and it'll be > consistent between browsers. > > Once you get this the parser should be happier with the file :-). > Its happier with the text file. The problem now is ye olde machine independent line-feed. The features and annotations run way to the right with some embedded octal 012s. My system is Win98. Its probably fine on Unix and Linux. Cayte From dalke at acm.org Wed Dec 6 22:50:25 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <013301c06000$d81fc0c0$95ac323f@josiah> Cayte: > Its happier with the text file. The problem now is ye olde machine > independent line-feed. The features and annotations run way to the > right with some embedded octal 012s That's my doings, I'm afraid. I didn't change a couple of the definitions to use the new \R syntax. One is the 'ToEol()' command, which Brad uses in his code. The fix should be to change Martel/__init__.py from if name is None: return Re(".*\n") else: return Group(name, Re(".*")) + Str("\n") to if name is None: return Re(r"[^\R]*\R") else: return Group(name, Re(r"[^\R]*")) + Re(r"\R") but I haven't tested it to make sure that's correct. Andrew From katel at worldpath.net Thu Dec 7 02:03:20 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Eol References: <006101c05f6a$607c39e0$95ac323f@josiah> Message-ID: <005801c0601b$c8e97360$010a0a0a@cadence.com> Should the last line of text have an implicit Eol? This test assumes it should, but the test failed. A test that's identical, except that the target text ends with a newline, passed. exp1 = Martel.ToEol() exp2 = Martel.ToEol() exp3 = Martel.ToEol() expression = exp1 + exp2 + exp3 print expression tagtable, want_flg = Martel.Generate.generate( expression ) success = tag( "abcdefghij\nOPQRSTUVWXY\n0123456789", tagtable )[ 0 ] return ( self.assert_condition( success == 1, "Failed" ) ) Cayte From dalke at acm.org Wed Dec 6 23:24:58 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Eol Message-ID: <013c01c06005$a9db5300$95ac323f@josiah> > Should the last line of text have an implicit Eol? This test >assumes it should, but the test failed. A test that's identical, >except that the target text ends with a newline, passed. The expression: > exp1 = Martel.ToEol() > exp2 = Martel.ToEol() > exp3 = Martel.ToEol() > expression = exp1 + exp2 + exp3 requires a final newline. It's possible to write an expression which doesn't need that, as with exp3 = Martel.Re(r"[^\R]*\R?") As written, it is hard in Martel to make the ToEol expression automatically recognize that a final newline is not needed. It could be written as [^\R]*(\R|$) assuming that $ was changed to mean "end of text" rather than end of line as I believe it does now. (I mentioned yesterday that I don't like the ^ and $ assertions.) Instead, it is easier (not necessarily better!) if the format author defines the last line to have an optional \R. Still, complications arise from interactions with the record readers. They read a record at a time and pass the string over to the parser. The '$' will match at the end of that string even though in the full format (non-record reader based) it would not have matched. After a bit of thought I realize that's a knee-jerk reaction. That isn't a big concern since there are similar problems already. For example, if the record parser uses "(.|\n)*" it will read up to the end of the record, but in the full format would read the whole file. Another solution is to have a specialzed ToEol (either a new function or an optional argument) which generates the "\R?" form. Finally, I don't think this is much of an issue for real formats. All the ones I've tested so far have a final newline, although I don't expect that to always be the case. In addition, the last line is usually well defined so a ToEol (special or otherwise) isn't needed. Eg, it can be defined with Re(r"///\R?") or Re(r"END\R?"). I'll point out that the record readers are designed so that a final newline is not needed for the record. Thus, any problems with a missing newline should be completely handleable by an appropriate format definition. Andrew dalke@acm.org From jchang at SMI.Stanford.EDU Thu Dec 7 19:24:23 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: <14893.59650.80411.133478@taxus.athen1.ga.home.com> Message-ID: > [SeqFeature classes] > > Yes, I definitely agree with needing a general class. However, I've been > > purposefully shying away from proposing a general framework for > > annotations for two main reasons. > > [Blah blah] > Agreed. I think our chances of getting it perfect are pretty slim > :-). However, I think it would really help writing "applications that > use Biopython" to have some kind of general class to work off of (even > if it is imperfect). It is just too much work to have to support > Genbank record classes and EMBL record classes and whatever record > classes. That sounds reasonable. Yes, having specific classes for every format is a lot of work. It's fine to map directly into a general class, since bioperl shows that it's still useful for people. > I'm not sure if my SeqFeature classes are the best thing ever, but it > is just meant as a bit of a start. I am definately willing to throw > out/ammend lots of what I wrote if people have got ideas for changing > them to be better. A good test of its generality is to see whether you can map the data from other classes, e.g. Fasta.Record, SwissProt.SProt.Record, or Medline.Record into it. > from Bio.Feature import SeqFeature > > Either way is fine, though (or I'm very open to additional > suggestions), so whatever you think. Are there going to be other Features not applied to sequences, such as StructFeature? I don't think there should be a separate package for Feature. The SeqFeature stuff should be close to the Seq class. Jeff From katel at worldpath.net Fri Dec 8 00:21:12 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go References: <003901c05f5e$9622c700$010a0a0a@cadence.com> <14893.60248.442490.847078@taxus.athen1.ga.home.com> <003601c0600a$42b56ee0$010a0a0a@cadence.com> Message-ID: <003401c060d6$af732280$010a0a0a@cadence.com> > ----- Original Message ----- > From: "Brad Chapman" > To: "Cayte" > Cc: > Sent: Tuesday, December 05, 2000 11:31 PM > Subject: Re: [Biopython-dev] GenBank parser -- first go > > > > > > You should be able to get the text GenBank version of any record > > without having to do a "save as text" on an html page. On the NCBI > > page, there is a Text button at the top of a list of records that > > will give you the flat-file text version of a record you searched > > for using Entrez. You can then save this as text, and it'll be > > consistent between browsers. > > > This should be fine for the first go. For some later go, I think we should strip the xml/html. If there are multiple ways of manually converting to text, you can just about guarantee all of them will be used sooner or later. As much as possible, manual editing should be replaced with automated enhancements. There are some difficulties with conversion to text, because html/xml isn't tied to the newline mechanism. It can position lines anyway it likes with any kind of fonts. Genbank may be one step away from a flat file, but its not true of all databases. Rebase and Gobase are examples. Cayte From dalke at acm.org Fri Dec 8 04:04:36 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <002501c060f5$e4b205a0$9cac323f@josiah> Jeff: >> I don't believe there's any general >> data structure in existance that can handle the genbank location >> field. It's describe by a BNF grammar and requires a tree! Me: >Speaking as a parsing problem, this cannot be done with regular >expression. When something like that occurs, it should be fine >to leave it as an opaque block of text, which is parsed elsewhere. > >John Aycock wrote a really nice context-free parser in pure >Python called SPARK. http://www.csr.uvic.ca/~aycock/python/ >Easier to use. (Which means it is *much* easier to use than >lax/yacc.) And here's a first run at a SPARK-based parser for the location part of the feature table. BTW, the documentation at http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html contains several errors that I could tell *** If a location is between 102 and 110 inclusive, do you use "(102.110)" as the example has, or "102.110" as given in the BNF? base_position ::= | | | two_base_bound ::= . *** Example 5.4 Plasmid has CDS join(complement(567..795)complement(21..349)) which ignores the comma CDS join(complement(567..795),complement(21..349)) ^^^ *** There is an example showing "J00194:(100..202)" which also does not agree with the BNF description. From looking at some real data, it seems the documentation should say "J00194:100..202". The BNF says symbol ::= | | where symbol_character ::= | | | _ | - | ' | * letter ::= | This means 'AA' can be parsed as | | | | "A" "A" or | | | | "A" "A" so it's an ambiguous definition. *** Additionally, symbol_character needs to allow '.' to agree with real-life data (see the regression tests for the text). Instead, I just redefined symbol ::= Re("[A-Za-z0-9_'*-][A-Za-z0-9_'*.]*") (note the "." in the second []). Anyway, the grammer is attached for anyone wishing to take it farther. Enjoy! Andrew -------------- next part -------------- # First pass at a parser for the location fields of a feature table. # Everything likely to change. # Based on the DDBJ/EMBL/GenBank Feature Table Definition Version 2.2 # Dec 15 1999 available from EBI, but the documentation is not # completely internally consistent much less agree with real-life # examples. Conflicts resolved to agree with real examples. # Uses John Aycock's SPARK for parsing from spark import GenericScanner, GenericParser # a list of strings to test test_data = ( "467", "23..400", "join(544..589,688..1032)", "1..1000", "<345..500", "<1..888", "(102.110)", "(23.45)..600", "(122.133)..(204.221)", "123^124", "145^177", "join(12..78,134..202)", "complement(join(2691..4571,4918..5163))", "join(complement(4918..5163),complement(2691..4571))", "complement(34..(122.126))", # The doc example allows "J00194:(100..202)" but not the BNF "J00194:100..202", "1..1509", "<1..9", "join(10..567,789..1320)", "join(54..567,789..1254)", "10..567", "join(complement(<1..799),complement(5080..5120))", "complement(1697..2512)", "complement(4170..4829)", # added a comma from the documentation "join(complement(567..795),complement(21..349))", "join(2004..2195,3..20)", "<1..>336", "394..>402", # a few examples from from hum1 "join(AB001090.1:1669..1713)", "join(AB001090.1:1669..1713,AB001091.1:85..196)", "join(AB001090.1:1669..1713,AB001091.1:85..196,AB001092.1:40..248,AB001093.1:96..212,AB001094.1:71..223,AB001095.1:87..231,AB001096.1:33..211,AB001097.1:35..175,AB001098.1:213..395,AB001099.1:56..309,AB001100.1:54..196,AB001101.1:171..404,AB001102.1:160..378,210..217)", "join(9106..9239,9843..9993,11889..11960,16575..16650)", "join(<1..109,620..>674)", "join(AB003599.1:<61..315,AB003599.1:587..874,47..325,425..>556)", "join(<85..194,296..458,547..>653)", ) class Token: def __init__(self, type): self.type = type def __cmp__(self, other): return cmp(self.type, other) def __repr__(self): return "Tokens(%r)" % (self.type,) # "38" class Integer: type = "integer" def __init__(self, val): self.val = val def __cmp__(self, other): return cmp(self.type, other) def __str__(self): return str(self.val) def __repr__(self): return "Integer(%s)" % self.val # From the BNF definition, this isn't needed. Does tht mean # that bases can be refered to with negative numbers? class UnsignedInteger(Integer): type = "unsigned_integer" def __repr__(self): return "UnsignedInteger(%s)" % self.val class Symbol: type = "symbol" def __init__(self, name): self.name = name def __cmp__(self, other): return cmp(self.type, other) def __str__(self): return str(self.name) def __repr__(self): return "Symbol(%s)" % repr(self.name) # ">38" -- The BNF says ">" is for the lower bound.. seems wrong to me class LowBound: def __init__(self, base): self.base = base def __repr__(self): return "LowBound(%r)" % self.base # "<38" class HighBound: def __init__(self, base): self.base = base def __repr__(self): return "HighBound(%r)" % self.base # 12.34 class TwoBound: def __init__(self, low, high): self.low = low self.high = high def __repr__(self): return "TwoBound(%r, %r)" % (self.low, self.high) # 12^34 class Between: def __init__(self, low, high): self.low = low self.high = high def __repr__(self): return "Between(%r, %r)" % (self.low, self.high) # 12..34 class Range: def __init__(self, low, high): self.low = low self.high = high def __repr__(self): return "Range(%r, %r)" % (self.low, self.high) class Function: def __init__(self, name, args): self.name = name self.args = args def __repr__(self): return "Function(%r, %r)" % (self.name, self.args) class AbsoluteLocation: def __init__(self, path, local_location): self.path = path self.local_location = local_location def __repr__(self): return "AbsoluteLocation(%r, %r)" % (self.path, self.local_location) class Path: def __init__(self, database, accession): self.database = database self.accession = accession def __repr__(self): return "Path(%r, %r)" % (self.database, self.accession) class FeatureName: def __init__(self, path, label): self.path = path self.label = label def __repr__(self): return "FeatureName(%r, %r)" % (self.path, self.label) class LocationScanner(GenericScanner): def __init__(self): GenericScanner.__init__(self) def tokenize(self, input): self.rv = [] GenericScanner.tokenize(self, input) return self.rv def t_double_colon(self, input): r" :: " self.rv.append(Token("double_colon")) def t_double_dot(self, input): r" \.\. " self.rv.append(Token("double_dot")) def t_dot(self, input): r" \.(?!\.) " self.rv.append(Token("dot")) def t_caret(self, input): r" \^ " self.rv.append(Token("caret")) def t_comma(self, input): r" \, " self.rv.append(Token("comma")) def t_integer(self, input): r" -?[0-9]+ " self.rv.append(Integer(int(input))) def t_unsigned_integer(self, input): r" [0-9]+ " self.rv.append(UnsignedInteger(int(input))) def t_colon(self, input): r" :(?!:) " self.rv.append(Token("colon")) def t_open_paren(self, input): r" \( " self.rv.append(Token("open_paren")) def t_close_paren(self, input): r" \) " self.rv.append(Token("close_paren")) def t_symbol(self, input): r" [A-Za-z0-9_'*-][A-Za-z0-9_'*.-]* " # Needed an extra '.' self.rv.append(Symbol(input)) def t_less_than(self, input): r" < " self.rv.append(Token("less_than")) def t_greater_than(self, input): r" > " self.rv.append(Token("greater_than")) # punctuation .. hmm, isn't needed for location # r''' [ !#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~] ''' class LocationParser(GenericParser): def __init__(self, start='location'): GenericParser.__init__(self, start) self.begin_pos = 0 def p_location(self, args): """ location ::= absolute_location location ::= feature_name location ::= function """ return args[0] def p_function(self, args): """ function ::= functional_operator open_paren location_list close_paren """ return Function(args[0].name, args[2]) def p_absolute_location(self, args): """ absolute_location ::= local_location absolute_location ::= path colon local_location """ if len(args) == 1: return AbsoluteLocation(None, args[-1]) return AbsoluteLocation(args[0], args[-1]) def p_path(self, args): """ path ::= database double_colon primary_accession path ::= primary_accession """ if len(args) == 3: return Path(args[0], args[2]) return Path(None, args[0]) def p_feature_name(self, args): """ feature_name ::= path colon feature_label feature_name ::= feature_label """ if len(args) == 3: return FeatureName(args[0], args[2]) return FeatureName(None, args[0]) def p_feature_label(self, args): """ label ::= symbol """ return args[0].name def p_local_location(self, args): """ local_location ::= base_position local_location ::= between_position local_location ::= base_range """ return args[0] def p_location_list(self, args): """ location_list ::= location location_list ::= location_list comma location """ if len(args) == 1: return args return args[0] + [args[2]] def p_functional_operator(self, args): """ functional_operator ::= symbol """ return args[0] def p_base_position(self, args): """ base_position ::= integer base_position ::= low_base_bound base_position ::= high_base_bound base_position ::= two_base_bound """ return args[0] def p_low_base_bound(self, args): """ low_base_bound ::= greater_than integer """ return LowBound(args[1]) def p_high_base_bound(self, args): """ high_base_bound ::= less_than integer """ return HighBound(args[1]) def p_two_base_bound(self, args): """ two_base_bound ::= open_paren base_position dot base_position close_paren """ # main example doesn't have parens but others do.. (?) return TwoBound(args[1], args[3]) def p_between_position(self, args): """ between_position ::= base_position caret base_position """ return Between(args[0], args[2]) def p_base_range(self, args): """ base_range ::= base_position double_dot base_position """ return Range(args[0], args[2]) def p_database(self, args): """ database ::= symbol """ return args[0].name def p_primary_accession(self, args): """ primary_accession ::= symbol """ return args[0].name def scan(input): scanner = LocationScanner() return scanner.tokenize(input) def parse(tokens): #print "I have", tokens parser = LocationParser() return parser.parse(tokens) if __name__ == "__main__": for s in test_data: print "--> Trying", s print repr(parse(scan(s))) From dag at sonsorol.org Fri Dec 8 14:18:44 2000 From: dag at sonsorol.org (Chris Dagdigian) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Changes to the wiki-enabled portions of our website(s) Message-ID: <5.0.2.1.0.20001208135218.00a99ec0@fedayi.sonsorol.org> Hi folks, Over the past few months there have been several incidents where people have abused the collaborative editing features contained within the wiki-enabled portions of the Open Bio websites (bioperl.org, biojava.org, biocorba.org, bioxml.org and biopython.org). The most recent incident happened within last 24 hours when someone deleted and/or attempted to change the bioperl wiki docs that outlined our release 07 roadmap and module checklist. Although we have enough logs & audit data to start tracking these people down we haven't bothered - simple web vandals are not worth our time. The CVS integration within Wiki makes it easy to roll back the malicious deletions & changes whenever we detect them. Special thanks are owed to Jason Stajich who wrote some behind-the-scenes scripts that automate the rebuild/recover process. The problem has now become one of administrative time and effort -- we have better things to do than monitor our wiki constantly. At the same time the obvious benefits of having anyone within our projects be able to create and update web content make it essential to keep the system around. Hence a compromise (and a bit of a social experiment): We are making the assumption that the web vandals are just random surfers who chanced on our site and could not resist the temptation of web links that say "edit this page" and "delete this page". We are hoping that they are not also subscribers who are reading our mailing lists :) So-- I have now password protected the "edit" and "delete" portions of all the various Open Bio project wiki sites. The 'experiment' is that this email is going to disclose the username and password so that all of you can continue to help improve and update our web content. We are hoping that this semi-public password will be enough to keep our site safe from the casual sort of mischief. Wiki edit/delete access info: ====================== username: wiki password: wicked Our backup plan if this experiment fails is to change the password and reveal it only to people who ask for it. I'm hoping that we will not have to take this step as it will have the effect of slowing down our content creation and updating progress. Regards, Chris (and all the Open Bio admin folks) Chris Dagdigian -- Blackstone Technology Group (Work ) dagdigian@computefarm.com (Home) dag@sonsorol.org (Web ) http://ComputeFarm.com, http://open-bio.org, http://sonsorol.org (More ) Full contact info and schedule -- http://sonsorol.org/dag/contact.html From dalke at acm.org Fri Dec 8 18:17:21 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Fw: VMD Python binaries available for testing Message-ID: <010d01c0616d$8d4961a0$9cac323f@josiah> This is a molecular visualization program I used to work on several years ago. It was all Tcl based, but now they are adding python support. Anyone interested in checking it out? Andrew -----Original Message----- From: Justin Gullingsrud To: vmd-l@ks.uiuc.edu Date: Friday, December 08, 2000 1:10 PM Subject: VMD Python binaries available for testing >A new feature in VMD 1.6 will be the addition of an embedded Python >interpreter in VMD, with the ability to run scripts, import existing >modules, and control VMD. The Python interpreter co-exists with the >Tcl interpreter which is also part of VMD; you can use either >interpreter, or both, and switch between them. > >Features > >Nearly all the VMD Tcl functions will have functional Python analogues >when VMD 1.6 is released. Support for the Tkinter GUI module will >also be provided. Complete documentation for the available Python >commands can be found in the User's Guide. VMD 1.6 will use Python >2.0. All of the Python modules for VMD will work without installing >Python on your system; of course, if you do have the Python libraries, >you can tell VMD where to find them and incorporate into your VMD >scripts. Again, see the documentation for more information. > >Installation > >Binaries for IRIX6, Linux-Mesa, and Linux-DRI are now available from the >TB ftp site, ftp.ks.uiuc.edu, in the directory pub/mdscope/vmd/python/. >The binaries are a beta version of VMD, not a final release. Installation >proceeds in exactly the same way as previous versions. You may need to set >an environment variable to direct the VMD Python interpreter to the location >of your Python libraries; e.g. > setnev PYTHONPATH /usr/local/lib/python2.0 > - or - > setenv PYTHONPATH /home/justin/vmd/Python-2.0/lib_LINUX/lib/python2.0 > >or > setenv PYTHONHOME /usr/local > - or - > setenv PYTHONHOME /home/justin/vmd/Python-2.0/lib_LINUX > >For PYTHONPATH, use the location of the actual python libraries. For >PYTHONHOME, use the directory in which python was installed (the prefix >directory in the configure script). > > >Try it out and let us know how it goes! > >Justin > >-- > >Justin Gullingsrud 3111 Beckman Institute >H: (217) 384-4220 I got a million ideas that I ain't even rocked yet... >W: (217) 244-8946 -- Mike D From dalke at acm.org Sat Dec 9 01:55:08 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] PIR parsing Message-ID: <003701c061ad$2194a4c0$1fac323f@josiah> Me: >I've written a much more complete PIR CODATA parser which works with >the latest PIR release (Release 66.00, September 30, 2000). I tested >it against pir1.dat and pir3.dat. I'm testing it against pir2.dat, which is 394,221,543 bytes uncompressed and 174,756 records. I'm doing the run on the bioperl.org machine since it has more disk space available than my laptop. The parser parses about 3 or 4 records per second (sshd takes 1/2 the CPU!). I've processed 15% of the records and found only two problems in my parser. Both are my fault because I made too strong an assumption of the format. BTW, the format definition at http://pir.georgetown.edu/pirwww/otherinfo/doc/co2.pdf is wrong in many of the details - probably because it is 6 years old. Andrew From dalke at acm.org Sat Dec 9 02:29:11 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] PIR parsing Message-ID: <004b01c061b1$bac740e0$1fac323f@josiah> Forgot to ask, What is the point of having both the "ref" and "dat" format in PIR. ref format example: >P1;I52708 ELAV-like neuronal protein 1, truncated splice form - human N;Alternate names: Drosophila ELAV(embryonic lethal, abnormal vision)-like 4; Hu a ntigen D; paraneoplastic encephalomyelitis antigen C;Species: Homo sapiens (man) dat format example: ENTRY I52708 #type complete TITLE ELAV-like neuronal protein 1, truncated splice form - human ALTERNATE_NAMES Drosophila ELAV(embryonic lethal, abnormal vision)-like 4; Hu antigen D; paraneoplastic encephalomyelitis antigen ORGANISM #formal_name Homo sapiens #common_name man As far as I can tell, the ref format is easier to machine parse than the dat one, and is more compact. The dat format is easier for a human to scan. Also, the dat format contains the sequence information while the ref one does not. Can anyone here provide to me some background? Andrew From jchang at SMI.Stanford.EDU Sat Dec 9 23:15:02 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] checked in code for GenBank access Message-ID: Hello everybody, Since we've got a GenBank parser in the works (thanks Brad!), I've checked in some code to search and retrieve records from GenBank. It's in Bio/GenBank. We'll also put the parser in there soon. Jeff From katel at worldpath.net Sun Dec 10 23:25:42 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] multiline location References: <002501c060f5$e4b205a0$9cac323f@josiah> Message-ID: <003501c0632a$6dd00720$010a0a0a@cadence.com> When I fed this multiline to parse_location.py """complement(join(8811..8995,9120..10082,10181..10291, 10608..10852,10996..11147,11461..11559)) """ It reported this errog message Syntax error at or near `Tokens('comma')' token The Trying -> print s line displayed -> Trying complement(join(8811..8995,9120..10082,10181..10291, 10608..10852,10996..11147,11461..11559)) <345..500 It was reading past the closing triple quote. Cayte From katel at worldpath.net Sun Dec 10 23:30:52 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] multiline location References: <002501c060f5$e4b205a0$9cac323f@josiah> <003501c0632a$6dd00720$010a0a0a@cadence.com> Message-ID: <003f01c0632b$25ff3640$010a0a0a@cadence.com> ----- Original Message ----- From: "Cayte" To: "Andrew Dalke" ; Sent: Sunday, December 10, 2000 8:25 PM Subject: [Biopython-dev] multiline location > When I fed this multiline to parse_location.py > > """complement(join(8811..8995,9120..10082,10181..10291, > 10608..10852,10996..11147,11461..11559)) > """ > It reported this errog message > Syntax error at or near `Tokens('comma')' token > > The Trying -> print s line displayed > -> Trying complement(join(8811..8995,9120..10082,10181..10291, > 10608..10852,10996..11147,11461..11559)) > <345..500 > > It was reading past the closing triple quote. > > Cayte > The test also failed when I used the backslash-linefeed-backslash format instead of the triple quote. cayte From edwin.steele at eBioinformatics.com Mon Dec 11 00:30:52 2000 From: edwin.steele at eBioinformatics.com (Edwin Steele) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] PIR parsing In-Reply-To: <004b01c061b1$bac740e0$1fac323f@josiah> Message-ID: <002801c06333$86adc170$bd2aa8c0@au.int.enbio.com> Andrew, > What is the point of having both the "ref" and "dat" format > in PIR. [snip] > As far as I can tell, the ref format is easier to machine parse > than the dat one, and is more compact. The dat format is easier > for a human to scan. Also, the dat format contains the sequence > information while the ref one does not. > > Can anyone here provide to me some background? seq is usually derived from dat so that blast databases (or anything else that requires fasta formatted sequences) can be made. I understand that ref is a trimmed down dat without sequence data so you can save some space by not keeping the partially redundant dat. I don't know for sure, but the more compact format might be another measure along those lines. Perhaps, though they're competing with the OWL database for the most obfuscated database format ;) Cheers, Edwin. ------------------------------------------------------------------------------- Edwin Steele QA Manager, eBioinformatics. http://www.ebioinformatics.com email: edwin.steele@eBioinformatics.com Bay 16/104, Australian Technology Park ph: +61 (2) 9209-4765 Eveleigh 1430, NSW, Australia. From edwin.steele at eBioinformatics.com Mon Dec 11 01:09:09 2000 From: edwin.steele at eBioinformatics.com (Edwin Steele) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: <001401c05f5c$056e0460$95ac323f@josiah> Message-ID: <002b01c06338$e049df70$bd2aa8c0@au.int.enbio.com> Brad, > Here's a justification for this. It's already common practice with > GenBank files to have subitems indented under the major item. For > example, > > SOURCE thale cress. > ORGANISM Arabidopsis thaliana There are a few caveats that come up with indenting that I've come across. Save the feature table, there used to be only one level of subitem. The new PUBMED tag breaks this paradigm: REFERENCE 1 (bases 1 to 675) AUTHORS Sant,V.J., Sainani,M.N., Sami-Subbu,R., Ranjekar,P.K. and Gupta,V.S. TITLE Ty1-copia retrotransposon-like elements in chickpea genome: their identification, distribution and use for diversity analysis JOURNAL Gene 257 (1), 157-166 (2000) PUBMED 11054578 It's indented three spaces instead of two... Brad, this will mean your indent_space definition will break (or pick up unnecessary stuff). Also, it's not fair to assume that the initial indenting is two spaces. In some of the larger entries like LMFLCHR12 that is about 2000000 bp long, the seven figures in the origin section causes there to be a one character indent instead of the normal two character minimum. ORIGIN 1 TCAGTTTGTG CGGGGTGTGC ATATGCATGT GCATGCATAC ATGCACATAC ACATATATAC ... 2287441 GCGTCACGTG GCGACGTCGA GGCCCGCAGC TTCTATTTTT TTT // However, I don't think this will break anything in the parser, but is something to be remembered if you become more strict... Cheers, Edwin. ------------------------------------------------------------------------------- Edwin Steele QA Manager, eBioinformatics. http://www.ebioinformatics.com email: edwin.steele@eBioinformatics.com Bay 16/104, Australian Technology Park ph: +61 (2) 9209-4765 Eveleigh 1430, NSW, Australia. From dalke at acm.org Mon Dec 11 15:55:12 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go Message-ID: <003f01c063b4$abbc95a0$c2ab323f@josiah> I was playing around with a different way to handle the FEATURES section and came across this example in IRO125195: FEATURES Location/Qualifiers source 1..1326 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /clone="IMAGE cDNA clone 125195" /clone_lib="Soares fetal liver spleen 1NFLS" /note="contains Alu repeat; likely to be be derived from unprocessed nuclear RNA or genomic DNA; encodes putative exons identical to FTCD; formimino transferase cyclodeaminase; formimino transferase (EC 2.1.2.5) /formimino tetrahydro folate cyclodeaminase (EC 4.3.1.4)" See the "/formimino"? I had thought that any line starting with a '/' was a new qualifier, but it looks like you really do have to parse the quotes as you go to tell when you are done. While the qouted quote checking (double the "s) is doable with a regular expression, it's gets pretty complicated. Andrew From dalke at acm.org Mon Dec 11 17:03:38 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Martel performance Message-ID: <004001c063be$3808a2c0$c2ab323f@josiah> I'm finding my PIR and the GenBank parser (the last rather modified from Brad's because I was trying to be more strict on whitespace) to be pretty slow. The PIR parser only parses 43K of text per second while the GenBank one is but 6.6K/second. Compare that to the SwissProt parser where I was parsing the whole file in 20 minutes, which is about 200K per second. These tests were done on different machines, but there's only about a factor of 2 performance difference between them. (Comparison done by running my genbank regression test on my Intel laptop and on the bioperl.org Alpha machine, which is where the PIR and GenBank tests are run. My laptop is faster although sshd on bioperl takes 50% of the CPU.) I can only think of a few reasons which might cause this: 1) Martel is intrinsicly slow - but see sprot as a counter example 2) These two files use indented whitespace for continuations an to indicate subitems. Almost every time you get to the end of a line it needs to test if the next line is a continuation. In most cases it isn't, so about 1/4 of the file is read twice. But that's not a factor of 20. 3) Brad has a list of possible feature key names and a list of qualifiers. Odds are you have to scan 1/2 the list before finding a matching name. This again causes some duplicate checks, but only in the features section and I just can't see another factor of two out of that. 4) The regexp to allow folding with the whitespace indentation is something like: indicator + \ Group("tag", text) + \ Rep(space_indent + Group("tag", text)) This can make for some very large regular expressions. GenBank, when expressed as a string, is about 6K long and the generated tag table itself is hard to guess, but it's roughly 100K while PIR is about 600K. These are state transition tables so perhaps I'm loosing cache coherency because most of my jumps are too large. I don't know what effect sshd has on the overall bioperl.org performance. It only have 72K of RSS so I can't see how there's a bad context swap hit. I can't find any equivalent on Linux to IRIX's 'osview' or 'gr_osview', which is what I usually used to look at this sort of overhead. Any pointers? 5) I'm using the same RecordReader for SWISS-PROT and GenBank (EndsWith) so that shouldn't be a problem. However, in the first I think I was using the reader directly while with GenBank I'm going through the HeaderFooter parser. There might be some difference there, but I can't think of what that might be. 6) Memory use I'm using gbpri8 as my test case. The first entry, HUAF001549, is about 260K long with 202K bases. This causes my format definition to take up 50MB (!) of memory according to top, so a 20-fold expansion. My test with SWISS-PROT and MDL's .mol files only needed a factor of about 6 as I recall. I don't know why so much memory is needed for GenBank and I didn't look at PIR's use to compare. As an aside, Edwin Steele points out that LMFLCHR12 has 2Mbases so is about an order of magnitude larger. Well, RAM is cheap. Without Martel running, bioperl.org's 'free' says: total used free shared buffers cached Mem: 126568 121048 5520 58568 4544 77624 -/+ buffers/cache: 38880 87688 Swap: 208760 23760 185000 When I run the test, it says: total used free shared buffers cached Mem: 126568 123056 3512 57656 3688 29760 -/+ buffers/cache: 89608 36960 Swap: 208760 23704 185056 Compare that to top's PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 7930 dalke 17 0 53240 51M 1824 R 51M 49.8 21.0 5:43 python As I read it, all of the memory is being used, but 77MB was used for cache. When the python job started, that moved out and giving 47MB to python, so it's all running in main memory. Only about 56K more of swap is being used, so there isn't a lot of page swapping going on. I've ordered a new disk for my laptop and more memory. That will give me a chance to test everything on dedicated machine. Hopefully the problem is simply context switch overhead with the sshd2 and http sessions on bioperl.org. I've put off doing real work for too long so I won't have time to look at this for a couple of weeks. If anyone wants to work out what the problem is using the latest code, it's on biopython.org in /tmp/dalke/gb/Martel . It's now in the tedious part of timing and profiling. (One approach might be to take a section of a file, duplicate it a lot of times, and measure how the times and memory use changes as a function of size.) Hmm. There is another difference between the GenBank format and the others. I'm using the \R construct for newline detection. Perhaps there's some unexpected performance hit there, though I can't see what that would be. Andrew dalke@acm.org From dalke at acm.org Mon Dec 11 17:19:04 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Martel performance Message-ID: <008c01c063c0$5fff2040$c2ab323f@josiah> P.S: By looking at the output as it parses, it's easy to tell that some of the records are processed quite quickly while others take a long time. That should provide some hint as to where the performance hit comes in. Andrew From edwin.steele at eBioinformatics.com Mon Dec 11 22:43:06 2000 From: edwin.steele at eBioinformatics.com (Edwin Steele) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] GenBank parser -- first go In-Reply-To: <003f01c063b4$abbc95a0$c2ab323f@josiah> Message-ID: <002601c063ed$a332d370$bd2aa8c0@au.int.enbio.com> Andrew, > See the "/formimino"? I had thought that any line starting > with a '/' was a new qualifier, but it looks like you really do > have to parse the quotes as you go to tell when you are done. > While the qouted quote checking (double the "s) is doable with > a regular expression, it's gets pretty complicated. I've found this too. A good test for a new qualifier is if it starts with a '/' and either: - Have an even no. of quotes and end with a '"' or - Have an odd no. of quotes and do not end with a '"' or - Have no quotes at all. Erk. Cheers, Edwin. ------------------------------------------------------------------------------- Edwin Steele QA Manager, eBioinformatics. http://www.ebioinformatics.com email: edwin.steele@eBioinformatics.com Bay 16/104, Australian Technology Park ph: +61 (2) 9209-4765 Eveleigh 1430, NSW, Australia. From katel at worldpath.net Wed Dec 13 03:10:08 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Unigene parsers References: <004001c063be$3808a2c0$c2ab323f@josiah> Message-ID: <005001c064dc$1faba820$010a0a0a@cadence.com> To write a UniGene parser, several issues need to be resolved. The UniGene page is structured with major keys and subkeys. Each major key is on a line be itself and is in all caps, but several subkeys can be placed on a single line. Each subkey is separated from its value by a colon. One problem is that the records vary in which keys they contain. I ran into this with Gobase. It required calls to routines with tests like start = string.find( text, field ) if( start == -1 ): return '' Calls to useless routines could waste a lot of CPU time. Would it be cleaner to read the major keys into a temporary dictionary and then consume the ones that ae present and check that all the necessary keys are present? A second problem is that since there can be several subkeys on a line, with only white space separating the value from the next key, multiword keys or values can be ambiguous. You can make guesses but there's no guaranteed way to disambiguate the subkey/value pairs. A third issue is that the record only displays the first ten sequences of the cluster. How do we deal with information that is spread over several web pages? Cayte From jchang at SMI.Stanford.EDU Wed Dec 13 18:19:10 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release Message-ID: Hello everybody, The plan was to try and get out a relatively quick release with Martel & mxTextTools bundled in. There's a few things we need to work out as this is happening: - Andrew, is Martel under source code control? Do you want to develop it as part of the biopython CVS (and release), or do you want it to be a dependency that's bundled and installed together? - Are we going to use/bundle mxTextTools 1.2? - setup.py now accepts earlier versions (<0.8?) of distutils. Should we require the version that comes with Python 2.0? This would simplify the script, I think. - Any objections to moving more code into __init__.py? For example, the code in Prosite/Prosite.py would be moved to Prosite/__init__.py. This would definitely BREAK CODE, but the fix would be trivial. If this does happen, does anyone know how to move code between files without losing the CVS logs of the changes? - Should we check in Brad's new GenBank code? - ... and Brad's SeqFeature classes? - Andrew, I've submitted a bug report (more of a feature request) in Jitterbug about making the regression tests indifferent to EOL conventions. This would be nice if people are developing and testing on different platforms, which breaks the tests. Could you look at it and let me know what you think? - Anyone good with Distutils and think they can get Martel and mxTextTools to install with biopython? :) Jeff From chapmanb at arches.uga.edu Thu Dec 14 05:24:49 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release In-Reply-To: References: Message-ID: <14904.40945.699446.764013@taxus.athen1.ga.home.com> Hey Jeff; > The plan was to try and get out a relatively quick release with Martel & > mxTextTools bundled in. Release early, release often -- sounds good! > - setup.py now accepts earlier versions (<0.8?) of distutils. Should we > require the version that comes with Python 2.0? This would simplify the > script, I think. I think we should do this -- we can detect an old version and just tell people to upgrade. They need to ugrade if they are using such an old version :-). > - Any objections to moving more code into __init__.py? For example, the > code in Prosite/Prosite.py would be moved to Prosite/__init__.py. This > would definitely BREAK CODE, but the fix would be trivial. This is okay by me, although I don't really think it's necessary. I don't find it that annoying to import the double Prosites (or whatever) but that is just me. The only sort of objection is that sometimes people don't look for actual code in __init__.py files (I know I didn't at first when I was using python), so it could make it more confusing to browse the code if you are new to python. But you know lots more about python coding and style than I, so if you prefer it, I'm not going to stop ya :-) > If this does > happen, does anyone know how to move code between files without losing the > CVS logs of the changes? I'm not enough of a CVS expert to know this -- maybe Ewan, the master o' CVS (and everything else :-), would be a good person to ask? > - Should we check in Brad's new GenBank code? > > - ... and Brad's SeqFeature classes? I hope to have a new version of these after this weekend, with suggestions from everyone included. Whether or not to include 'em is up to everyone else though. I do plan to do a GenBank specific record class, so if people don't like the SeqFeature classes, we can just include the GenBank specific stuff. > - Anyone good with Distutils and think they can get Martel and mxTextTools > to install with biopython? :) If we can get Martel and mxTextTools to install with distutils, then I think I could try this. The last version of mxTextTools that I could find doesn't use distutils, but I might be missing a newer version that does. I thought I saw MAL asking lots of questions about distutils on the SIG mailing list... I guess doing this would be a matter of: o Doing a test import mxTextTools and Martel. o If they can't be imported -- fetch them from an ftp site and unpack them. o Run setup.py on these modules and install them, then go back to the regular installation. I think the newly formed catalog-sig (http://python.org/sigs/catalog-sig/) is interested in getting something like this going generally, but I'm not sure at all about the status of any kind of implementation. Brad From jchang at SMI.Stanford.EDU Thu Dec 14 18:37:53 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release In-Reply-To: <14904.40945.699446.764013@taxus.athen1.ga.home.com> Message-ID: > > - setup.py now accepts earlier versions (<0.8?) of distutils. Should we > > require the version that comes with Python 2.0? This would simplify the > > script, I think. > > I think we should do this -- we can detect an old version and just > tell people to upgrade. They need to ugrade if they are using such an > old version :-). OK. > > - Any objections to moving more code into __init__.py? For example, the > > code in Prosite/Prosite.py would be moved to Prosite/__init__.py. This > > would definitely BREAK CODE, but the fix would be trivial. > > This is okay by me, although I don't really think it's necessary. I > don't find it that annoying to import the double Prosites (or > whatever) but that is just me. Yeah, it's mostly a cosmetic change. Plus, it seems to match how things are done in other packages, e.g. Martel, xml. > > - Should we check in Brad's new GenBank code? > > > > - ... and Brad's SeqFeature classes? > > I hope to have a new version of these after this weekend, with > suggestions from everyone included. Whether or not to include 'em is > up to everyone else though. I do plan to do a GenBank specific record > class, so if people don't like the SeqFeature classes, we can just > include the GenBank specific stuff. Alright. Let's plan on including everything, unless anyone has strenuous objectsion. Let's see what happens... Are you also including Andrew's location parser? > > - Anyone good with Distutils and think they can get Martel and mxTextTools > > to install with biopython? :) > > If we can get Martel and mxTextTools to install with distutils, then I > think I could try this. The last version of mxTextTools that I could > find doesn't use distutils, but I might be missing a newer version > that does. I thought I saw MAL asking lots of questions about > distutils on the SIG mailing list... Hmmm, if he's working on disutils-ing the package, then we shouldn't duplicate that work. Andrew, do you know anything about this? Do you mind sending him a quick email to see whether it's going to happen? > I guess doing this would be a matter of: > > o Doing a test import mxTextTools and Martel. > > o If they can't be imported -- fetch them from an ftp site and unpack > them. How to handle version differences? Jeff From katel at worldpath.net Fri Dec 15 01:22:33 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Unigene parsers( cintinued ) References: <004001c063be$3808a2c0$c2ab323f@josiah> <005001c064dc$1faba820$010a0a0a@cadence.com> Message-ID: <004f01c0665f$69d580e0$010a0a0a@cadence.com> ----- Original Message ----- From: "Cayte" To: Sent: Wednesday, December 13, 2000 12:10 AM Subject: [Biopython-dev] Unigene parsers > A third issue is that the record only displays the first ten sequences of > the cluster. How do we deal with information that is spread over several > web pages? > I think the www scrips need to search for the correct link and pull in the information. The 10 sequence limit only makes sense in the GUI, not the way the user is likely to use our scripts. Other parsers may need this capability too. Cayte From dalke at acm.org Sat Dec 16 21:12:21 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Re: Martel performance Message-ID: <018f01c067ce$cc29f6c0$afab323f@josiah> Short version. I found the source of the slowdown. I'll revert my change which caused the problem, but that reintroduces a behavioral problem I don't like. Unfortunately, it's a behaviour which is pretty inherent in Martel and I don't see an easy fix, so it will have to stay in until someone better than I figures out a good solution. Me: >I'm finding my PIR and the GenBank parser (the last rather modified >from Brad's because I was trying to be more strict on whitespace) >to be pretty slow. The PIR parser only parses 43K of text per >second while the GenBank one is but 6.6K/second. Compare that >to the SwissProt parser where I was parsing the whole file >in 20 minutes, which is about 200K per second. >I can only think of a few reasons which might cause this: Ha! There's another which was the real answer. The MaxRepeat expression (used for "*" repeats) had a bug where it doesn't fully allow backtracking. For example, suppose you have ([A-C][X-Z])*[A-F] and match it against "BZA". The bug in the code was that it would match B against "[A-C]" then Z against "[X-Z]". It would next try the A against "[A-C]" which would match so try matching [X-Z], which fails since there is no other character. The bug was that it wouldn't backtrack against a *partial* match, so it wouldn't try to see if [A-F] matches. This was because I was using if max_count == sre_parse.MAXREPEAT: result.append( (None, TT.SubTable, tuple(tagtable), +1, 0)) which expands the current taglist instead of creating a new subtaglist. (Meaning matches were added to the current list as they were found, rather than building a sublist and merging the result only if the whole list matches.) The fix I did was to replace the SubTable with a Table, and use a fake tag name to tell mxTextTools to append the matches upon success. if max_count == sre_parse.MAXREPEAT: #result.append( (">ignore", TT.Table, tuple(tagtable), +1, 0)) The consequence is that every repeat creates a new list with the tag ">ignore" associated with it. This explains the memory use and performance. Eg, consider ".*\n". This is converted to something like (?P.)*\n which means matching a line of text of length N + "\n" creates N sublists - one for each character. When I took that fix out of Martel, the performance, which was about 2 records per second, went up to 54 records per second. My test set, gbpri8, is 96MB and can be parsed in 525 seconds, or 187K/second. This is equivalent to what I was getting for parsing SWISS-PROT. That actually the clue for how I found the problem. I was showing off Martel yesterday and noticed the SWISS-PROT parser was a lot slower than it used to be. That indicated that the shift in performance was not anything to do with the machine or the specific file format but with some change in Martel. I mulled about about it enough that this morning, when I was trying to sleep in, I ended up instead thinking about what could be the cause. Luckily, I remembered what I was thinking about when I finally did wake up :) Going back to the topic, it actually points out a problem in Martel in that it isn't a true regular expression engine. Once a full match occurs it doesn't consider other alternative. Consider something which parses some of the feature keys for GenBank. It may have something like ... |prim_transcript|primer|primer_bind|protomer| ... Suppose you have the key "primer_bind". In that case, "primer" will match (because that's the start of the word). So next it tries to match the spaces after the key and that fails, because '_bind' isn't a space. A real regular expression engine would backtrack, throw the 'primer' match away and try again. Martel doesn't do that. Once it does a match for a given grouping, it stays matched. Higher level matches may discard submatches which is why Martel appears to do backtracking. The workaround for this '|' problem is to put the larger patterns first, so place "primer_bind" before "primer". Similarly, there is a workaround to the "*" problem by putting the subpattern in an explicit group rather than my solution of always putting in an implicit '>ignore' group. As another example of the problem, suppose you want to match "\s+\n". Martel will fail because \s+ consumes the final "\n" so there is no additional text to match the \n after the \s+. Again, a standard regexp engine will throw away the \s match against \n and try again. Martel does not. The workaround is to do something like " *\n". I don't really like these workaround solutions because they require people to be more aware of the differences between Martel's regular expressions and the standard ones. I haven't been clever enough to figure out a good solution using mxTextTools. On the other hand, I haven't been greatly concerned with it because: - Martel's behaviour is a subset of standard regular expressions, so if Martel matches so will the standard one; - I figure someone cleverer than I may contribute a good solution; - At some point the whole evalutation engine may be replaced by a C extension which can be made portable to Perl, Tcl, Java, etc; - I've still been prototyping to see what's useful and what isn't; - and of course, I know how to do the workarounds :) I've been reading a bit of the regular expression and pattern matching literature. There are a lot of terms to describe the types of regular expression languages. For example, SGML DTDs are "1-unambiguous" because it only needs to look ahead a single tag to determine the next step in the DTD. There's also "deterministic" regular expressions. I've decided I really need to talk to someone who knows the field... Andrew From dalke at acm.org Sat Dec 16 21:40:30 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release Message-ID: <01a801c067d2$b9b7d6c0$afab323f@josiah> Jeff: >- Andrew, is Martel under source code control? Yes, CVS. > Do you want to develop it as part of the biopython CVS (and > release), or do you want it to be a dependency that's bundled > and installed together? I would rather it be the latter. I expect people will want to use Martel independent of the other biopython code. I can move the development repository to biopython.org. My concerns with that are two-fold. First, I haven't figured out how to connect to my ISP from under Linux, so I don't have a direct connection to the rest of the world. That makes it hard to talk to CVS. Second, supposing an update to a newer distribution fixes my problem, I do most of my work on my laptop which isn't always connected to the rest of the world. I habitually CVS commit a lot more frequently than I connect and I worry about how that will affect my development habits. >- Are we going to use/bundle mxTextTools 1.2? Martel should work fine with 1.1.1 or 1.2, so the first is not a concern. I'll ask Marc-Andre about his release plans. >- setup.py now accepts earlier versions (<0.8?) of distutils. > Should we require the version that comes with Python 2.0? Yes. We have other dependencies now on 2.0 than just setup.py >- Any objections to moving more code into __init__.py? For > example, the code in Prosite/Prosite.py would be moved to > Prosite/__init__.py. This would definitely BREAK CODE, but > the fix would be trivial. I have no problems, but I think I'm the one who introduced using __init__.py to biopython so I'm not the best of sources. Brad correctly pointed out that some people don't know about that use so may get somewhat confused about it. As I recall, others here and elsewhere have had that problem so it shouldn't be ignored. On the other hand, I have had problems with another library which had a module of the form "X.X" (like Prosite.Prosite). In that case I needed to get elements from X and from X.X. That has to be done with import X.X import X a = X.X.a b = X.b The "import X.X" is needed to load X.X then the "import X" is needed to bring the top-level module into the local namespace. What this means is if you have Prosite/Prosite.py then do not put anything into Prosite/__init__.py and vice versa. > If this does happen, does anyone know how to move code > between files without losing the CVS logs of the changes? I don't know. Also, is there any way to import the CVS logs of Martel? >- Should we check in Brad's new GenBank code? I think Brad and I still need to do a bit more work on the parser definition. Neither his original code nor my modified version pass the "fully parses an NCBI file" although it's getting pretty close. A related question, and one which was raised earlier, is, where should the format definitions be located in biopython? There are also database specific builders (which convert the format definitions to a database specific data structure) and generic builders (eg, which make a generic data structure but possibly discarding some data). >- Andrew, I've submitted a bug report (more of a feature request) >in Jitterbug about making the regression tests indifferent to >EOL conventions. This would be nice if people are developing >and testing on different platforms, which breaks the tests. >Could you look at it and let me know what you think? Umm, I don't see it. There are none assigned to me ... Oh! with br_regrtest. Sorry, I thought you were talking about Martel. Some of it's regression tests are also newline specific. Okay, it shouldn't be too hard. Could either see what changes are in the 2.0 distribution or replace the line reader with something which understands the different styles. Jeff in a reply to Brad: > Are you also including Andrew's location parser? Remember, that parser hasn't been seriously tested. Also, including it requires inclusion of SPARK. That's not hard because it's a single, pure-python file. I think it should be included because of its general usefulness and because it isn't a real distribution in its own right. Andrew dalke@acm.org From jchang at SMI.Stanford.EDU Mon Dec 18 14:51:10 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release In-Reply-To: <01a801c067d2$b9b7d6c0$afab323f@josiah> Message-ID: On Sat, 16 Dec 2000, Andrew Dalke wrote: > Jeff: > > Do you want to develop it as part of the biopython CVS (and > > release), or do you want it to be a dependency that's bundled > > and installed together? > > I would rather it be the latter. I expect people will want > to use Martel independent of the other biopython code. Sounds reasonable. One thing, though: Martel currently ships with a few formats for databases. I do want the ones used for biopython to be CVS'd in the biopython repository, so that developers with read/write access to biopython can work on the formats. I don't think biopython should depend on format definitions in Martel. > I can move the development repository to biopython.org. My concerns > with that are two-fold. First, I haven't figured out how to connect > to my ISP from under Linux, so I don't have a direct connection to the > rest of the world. That makes it hard to talk to CVS. Either way doesn't make a difference to me. > >- Are we going to use/bundle mxTextTools 1.2? > > Martel should work fine with 1.1.1 or 1.2, so the first is > not a concern. I'll ask Marc-Andre about his release plans. Thanks! > >- setup.py now accepts earlier versions (<0.8?) of distutils. > > Should we require the version that comes with Python 2.0? > > Yes. We have other dependencies now on 2.0 than just setup.py OK. > >- Any objections to moving more code into __init__.py? For > > example, the code in Prosite/Prosite.py would be moved to > > Prosite/__init__.py. This would definitely BREAK CODE, but > > the fix would be trivial. > > I have no problems, but I think I'm the one who introduced > using __init__.py to biopython so I'm not the best of sources. > > Brad correctly pointed out that some people don't know about > that use so may get somewhat confused about it. As I recall, > others here and elsewhere have had that problem so it shouldn't > be ignored. Yeah, definitely. I still overlook __init__.py when looking for code. What can be done about this? Documentation? > On the other hand, I have had problems with another library > which had a module of the form "X.X" (like Prosite.Prosite). Oh, I see what you're getting at. That's definitely bad. > What this means is if you have Prosite/Prosite.py then do not > put anything into Prosite/__init__.py and vice versa. Yep. I'll interpret that as evidence to move stuff into __init__.py. :) > > If this does happen, does anyone know how to move code > > between files without losing the CVS logs of the changes? > > I don't know. Also, is there any way to import the CVS logs > of Martel? I suspect both solutions will require some surgery on the CVS repository and RCS files. > >- Should we check in Brad's new GenBank code? > > I think Brad and I still need to do a bit more work on the > parser definition. Neither his original code nor my modified > version pass the "fully parses an NCBI file" although it's > getting pretty close. Is this a "no" vote, then? Rebuttals? > A related question, and one which was raised earlier, is, > where should the format definitions be located in biopython? > > There are also database specific builders (which convert the > format definitions to a database specific data structure) and > generic builders (eg, which make a generic data structure > but possibly discarding some data). There's two places they can go. First, you can put each one in the package in which it belongs. That means, the fasta format would go in Bio/Fasta, genbank in Bio/GenBank, swissprot in Bio/SwissProt, etc. This would be consistent with the current design, and it would be clear where to look for the format. Second, we can have a formats package (could be called something else), where we put all the Martel stuff. This would make it easier to check to see what formats exist, which could be helpful for SeqIO-type functionality. All you'd have to do is scan the directory and suck up all the formats in there. The other way, we'd have to specify them manually. Any votes? Comments? > >- Andrew, I've submitted a bug report (more of a feature request) > >in Jitterbug about making the regression tests indifferent to > >EOL conventions. This would be nice if people are developing > >and testing on different platforms, which breaks the tests. > >Could you look at it and let me know what you think? > > Umm, I don't see it. There are none assigned to me ... Oh! > with br_regrtest. Sorry, I thought you were talking about > Martel. Some of it's regression tests are also newline > specific. Okay, it shouldn't be too hard. Could either see > what changes are in the 2.0 distribution or replace the line > reader with something which understands the different styles. Great! This will be nice, because some of the regression tests are breaking on differing newline conventions. Different styles can occur within the same file. > Jeff in a reply to Brad: > > Are you also including Andrew's location parser? > > Remember, that parser hasn't been seriously tested. Also, > including it requires inclusion of SPARK. That's not hard > because it's a single, pure-python file. I think it should > be included because of its general usefulness and because > it isn't a real distribution in its own right. Good, we'll include it, as well as SPARK, then. Judging from the recent traffic on the bioperl list, this is a feature we should have. Jeff From jchang at SMI.Stanford.EDU Mon Dec 18 14:58:21 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] external dependencies... Message-ID: How are we going to handle them? Right now, we have: Numeric Martel mxTextTools (with Martel) SPARK - Should we auto-detect whether they have it already installed? - How do we handle version differences? - Should we bundle these in the distribution or download them as needed? - How much help should we provide the user in installing them? Completely automatic installation, or just gentle error messages and URL's? - Should we maintain copies of these at biopython.org? - Is there something going on in catalog-sig or somewhere else that can help us right now? Jeff From dalke at acm.org Mon Dec 18 15:43:28 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release Message-ID: <00ab01c06935$b5799080$b0ac323f@josiah> Me: >> I think Brad and I still need to do a bit more work on the >> parser definition. Neither his original code nor my modified >> version pass the "fully parses an NCBI file" although it's >> getting pretty close. Jeff: >Is this a "no" vote, then? Rebuttals? Upon consideration, no, it is not a no vote. Go ahead and include it, but with the proviso that it is still in flux. Ditto for the location parser. Andrew From jchang at SMI.Stanford.EDU Mon Dec 18 17:38:06 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release In-Reply-To: <00ab01c06935$b5799080$b0ac323f@josiah> Message-ID: Alright. We'll plan on including it in the next release, which is pre-alpha. Then, we'll pound on it, and other things, for 1.0. Jeff On Mon, 18 Dec 2000, Andrew Dalke wrote: > Me: > >> I think Brad and I still need to do a bit more work on the > >> parser definition. Neither his original code nor my modified > >> version pass the "fully parses an NCBI file" although it's > >> getting pretty close. > > Jeff: > >Is this a "no" vote, then? Rebuttals? > > Upon consideration, no, it is not a no vote. Go ahead and > include it, but with the proviso that it is still in flux. > > Ditto for the location parser. > > Andrew > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From dalke at acm.org Tue Dec 19 13:17:41 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] questions for next release Message-ID: <000e01c069e7$fcdede00$28ac323f@josiah> Jeff >> >- Are we going to use/bundle mxTextTools 1.2? Me: >> Martel should work fine with 1.1.1 or 1.2, so the first is >> not a concern. I'll ask Marc-Andre about his release plans. M-A Lemburg: > I wouldn't mind if you intergrate mxTextTools in your distro. > My plans are to release all mx extensions using distutils > and a new packaging strategy sometime in January next year. > I will distutil the mx packages in three or more distributions > (base, crypto, commercial) to enable dependencies between > the packages. mxTextTools will be in the base version which > will be open source as before only with a more Python 2.0 > like license. Andrew dalke@acm.org From chapmanb at arches.uga.edu Wed Dec 20 10:11:49 2000 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Second go at GenBank parser Message-ID: <14912.52277.659316.598153@taxus.athen1.ga.home.com> Hello all; I've got together a second tarball of the GenBank parser that we've been working on. You can grab it from: http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001222.tar.gz I think this is a huge improvement from the first, mostly due to the many many helpful comments from everyone here. I really appreciated everyone's comments and interest, and I think that we've fixed/worked on all of the points that people raised. I'll try to respond to some specific mails later today. Sorry to not be able to respond to everything in a timely manner. I guess if I only have time to write or code, it is better to be coding :-). Anyways, the new version has the following new and oh-so-incredibly-exciting features: o Much better Martel syntax for parsing things. This is almost entirely due to Andrew -- who sent me lots of nice comments and good tips, and even wrote up his own syntax which I could borrow from. Tons of the new syntax is taken from Andrew's stuff, so he deserves a huge pat on the back for this :-). o Tested on a bunch of different downloads from the ncbi genbank directory, so the syntax is much more "battle tested" then the last and handles lots more cases, including the dreaded "fake /" cases (found some more hideous ones like that in a bacterial dataset). GenBank, wow, what a headache! o I integrated Andrew's SPARK based location parser, and now use it to parse the locations. spark.py is included in the tarball, but we need to still figure out how we want to do it in Biopython (once the GenBank parser is up to snuff). Another big thanks to Andrew for providing the location parser! I integrated this first before doing all the testing, so it has been through a workout over here. I found one case it didn't handle (when you have a "between" location by itself without parentheses, like '6.27') and made the small fix for this. Otherwise it performed great! o Coded up a Record class for GenBank record and added a parser and consumer that parse GenBank data into it. o Miscellaneous bug fixes that popped up (hopefully I squashed more than I introduced :-). o Better testing -- again thank to Andrew. Have I mentioned yet that he is my personal hero? If people have time to download and test this and give me their feedback I would really appreciate it. I only want to get it into Biopython if people feel it is up to par (don't want to bring down the good name of Biopython :-). I'm especially interested in feedback on the following points: o I would really like to hear about anything that causes errors in any of the parsers (or my code!). o Naming of modules -- right now my naming sucks (the "supplimentary" feature classes, like Location.py and Reference.py are in a module called 'FeatureInfo', for instance. yeck.), so if people have good ideas for how to name things I'll definately take 'em. I'm also not sure where a good place for spark.py to live in Biopython is (BTW, I think we should include it :-). Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py -- Should my GenBank.py go into __init__.py? Should it be named something else? o Data transfer -- if everything being transferred okay? Am I messing anything up/losing data? People hand checking different records for me would be very very helpful. o HTML -- Cayte expressed concerns about parsing GenBank files with a bunch o' HTML stuck in them. In my opinion it isn't really worth worrying about this because it is so easy to get the text flat files -- do lots of people think I should work on html support, or do they agree with me? Thanks again for everyone's feedback on the first version! Brad From jchang at SMI.Stanford.EDU Wed Dec 20 19:23:21 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:55 2005 Subject: [Biopython-dev] Second go at GenBank parser In-Reply-To: <14912.52277.659316.598153@taxus.athen1.ga.home.com> Message-ID: Hi Brad, This is great! You've filled two gaping holes in biopython functionality. Please check these in, as I'm sure people will want to start using the code. > o Tested on a bunch of different downloads from the ncbi genbank > directory, so the syntax is much more "battle tested" then the last > and handles lots more cases, including the dreaded "fake /" cases > (found some more hideous ones like that in a bacterial > dataset). GenBank, wow, what a headache! Good. GenBank is notoriously hard to deal with, and I suspect work on the format will be ongoing. > o I integrated Andrew's SPARK based location parser, and now use it to > parse the locations. spark.py is included in the tarball, but we need > to still figure out how we want to do it in Biopython Yep, definitely a good thing. Using SPARK is the right way to go. > o Coded up a Record class for GenBank record and added a parser and > consumer that parse GenBank data into it. Thanks! > I only want to get it into Biopython if people feel it is up to par > (don't want to bring down the good name of Biopython :-). Heh. From what I gather, it's runnable. Let's get this out the door so people can start using it, and hopefully give good comments and (even better) patches. > o Naming of modules -- right now my naming sucks (the "supplimentary" > feature classes, like Location.py and Reference.py are in a module > called 'FeatureInfo', for instance. yeck.), so if people have good > ideas for how to name things I'll definately take 'em. Are these meant to be used with SeqFeatures? If so, how about just SeqFeature.Location and SeqFeature.Reference? > I'm also not sure where a good place for spark.py to live in Biopython > is (BTW, I think we should include it :-). Where you have it now seems as good a place as any (without the PGML). Including it is fine with me. > Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py -- > Should my GenBank.py go into __init__.py? Yes. GenBank is a good name for it, and as per Andrew's earlier email, we should avoid having code in both GenBank/__init__.py and GenBank/GenBank.py. > o HTML -- Cayte expressed concerns about parsing GenBank files with a > bunch o' HTML stuck in them. In my opinion it isn't really worth > worrying about this because it is so easy to get the text flat files > -- do lots of people think I should work on html support, or do they > agree with me? Are the HTML-formatted files different? Does it work if you just strip the HTML tags? I guess for HTML-formatted data from GenBank, it would be nice to handle, but very low priority. HTML-formatted data from other sources, no. If someone needs that functionality, they can submit the patches! :) Jeff From katel at worldpath.net Thu Dec 21 04:50:48 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:56 2005 Subject: [Biopython-dev] Second go at GenBank parser References: <14912.52277.659316.598153@taxus.athen1.ga.home.com> Message-ID: <001901c06b33$800b4420$010a0a0a@cadence.com> I ran several files, tonight, on my win98 machine. There is still a pesky newline problem that shows up only in the feature section. If the feature contains a translation, just the first line appears, followed by backslash - 0-1-2. The translation feature is the only multiline subfeature I've seen so far. I can attach the files if you wish. I haven't seen the newlines in the other sections, this time. Apparently, they've been removed. The lines of output are long, but this is not a problem, because the user can break the lines up easily in his script. Cayte