From dalke at acm.org  Fri Dec  1 05:06:00 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Martel-0.4 available
Message-ID: <000901c05b7e$505616c0$efab323f@josiah>

I've placed a copy of the Martel-0.4 distribution at
  http://www.biopython.org/~dalke/Martel/Martel-0.4.tar.gz

The only real change between this and version 0.35 is the
support for different newline conventions.

  New regexp syntax - \R
     \R    means "\n|\r\n?"
     [\R]  means "[\n\r]"

  New Expression Node - AnyEOL
     implements the \R test

  RecordReaders rewritten to use mxTextTools to find record
begin and end characters rather than using readline/readlines.

  RecordReaders' __init__ and .remainder() pass around a
lookahead buffer as a string rather than a list of lines.
Parser.py appropriately modified.

There's also a very complete regression suite for the new readers
because the new code is prone to more subtle errors.  (Hand-written
state tables combined with hand-written/emulated continuations
doesn't make for easy-to-write code.)

None of the format definitions have yet been modified to use the
new \R syntax.  This is more meant to be trial code to see if 
my solution to the newline problem is appropriate.  I think it
does but would like feedback.

Finally, I believe the API is now pretty solid for:
  - the regexp syntax 
  - how to make a parser
  - how to make an iterator (would like a bit more feedback)
  - RecordReader protocol (would like feedback)

This means I think we can start trying Martel for real work,
as the changes will only be in the specific formats and without
a global impact.

Assuming there are no bugs :)

BTW, I really hate being sick.  Haven't been able to do much
of anything requiring sustained thought for the last 3 days :(
Luckily, I had mostly finished this up over the Thanksgiving
weekend so it didn't require much more work.

                    Andrew
                    dalke@acm.org


From jchang at SMI.Stanford.EDU  Fri Dec  1 19:44:01 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <14886.47131.653099.144288@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0012011630380.7940-100000@riboweb.Stanford.EDU>

On Thu, 30 Nov 2000, Brad Chapman wrote:

> I would really like to have bugs sent to the dev list when they come
> in -- I just noticed a couple from Iddo that I should have dealt with
> (I think that is all fixed now, regardless), but didn't realize were
> there. Whadda you all think about this?


I thought that Jitterbug was configured to do this automatically, but now
that I think about it, I submitted a report that wasn't forwarded to
biopython-dev!  Does anyone know how to administer Jitterbug to do this?

Anyways, we do have a few bugs in the database now.  Could developers
please check the database and try and knock a few of these off?

Jeff


From thomas at cbs.dtu.dk  Sat Dec  2 09:46:47 2000
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] xbbtools + CVS
Message-ID: <14889.2903.802767.427950@bb1.home>

Hej,

It was actually a long time since the last work on xbbtools - I dont
remeber how to login on the cvs server, or has anything changed (eg my passwd) ?

I tried:
cvs -d pserver:thomas@cvs.biopython.org:/home/repository/biopython checkout biopython
cvs -d ext:thomas@cvs.biopython.org:/home/repository/biopython checkout biopython

and receive only:
Permission denied.
cvs [checkout aborted]: end of file from server (consult above messages if
any)

Suggestions ?

(Is anybody working on modules for reading and writing different sequence
formats ?)

c ya
-thomas

Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@biopython.org        The Technical University of Denmark
CBS:  +45 45 252485         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas/index.html

        De Chelonian Mobile ... The Turtle Moves ...

From chapmanb at arches.uga.edu  Sat Dec  2 09:59:49 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] xbbtools + CVS
In-Reply-To: <14889.2903.802767.427950@bb1.home>
References: <14889.2903.802767.427950@bb1.home>
Message-ID: <14889.3685.157030.758475@taxus.athen1.ga.home.com>

Hi Thomas;
Nice to hear from you!

[CVS]
> I tried:
> cvs -d pserver:thomas@cvs.biopython.org:/home/repository/biopython 
> checkout biopython
> cvs -d ext:thomas@cvs.biopython.org:/home/repository/biopython 
> checkout biopython

Hmmm, I think it should be:

cvs -d :ext:thomas@biopython.org:/home/repository/biopython co
biopython

(a colon before the ext + biopython.org not cvs.biopython.org)

At least, that is equivalent to what I use. When using ext, make sure
you have the environmental variable CVS_RSH set to 'ssh' (or something 
different if your ssh executable is different). 

> and receive only:
> Permission denied.
> cvs [checkout aborted]: end of file from server (consult above messages if
> any)

Well, that's not a very useful error message from cvs :-).

> Suggestions ?

Let me know if this works (or at least gets a more helpful message
from CVS).

> (Is anybody working on modules for reading and writing different sequence
> formats ?)

Well, my project for the weekend is starting on a GenBank parser using 
the new stuff from Martel. I've gotten started and hope (maybe) to
have something that people could test a little bit once the weekend 
is finished. What kind of formats are you looking to parse?

Brad


From thomas at genome.cbs.dtu.dk  Sat Dec  2 10:20:47 2000
From: thomas at genome.cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] xbbtools + CVS
In-Reply-To: <14889.3685.157030.758475@taxus.athen1.ga.home.com>
Message-ID: <Pine.SGI.3.95.1001202161338.3483367A-100000@genome.cbs.dtu.dk>

Hej Brad,

> Hmmm, I think it should be:

> cvs -d :ext:thomas@biopython.org:/home/repository/biopython co

thx - that worked !

> 
> > (Is anybody working on modules for reading and writing different sequence
> > formats ?)
> 
> Well, my project for the weekend is starting on a GenBank parser using 
> the new stuff from Martel. I've gotten started and hope (maybe) to
> have something that people could test a little bit once the weekend 
> is finished. What kind of formats are you looking to parse?

All of them :-) - I need it for my graphical sequence editor for reading and 
writing the edited sequence in different formats.
(In biowish I included part of the readseq code as a shared c-library)

thx
-thomas

Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@biopython.org        The Technical University of Denmark
CBS:  +45 45 252489         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From chapmanb at arches.uga.edu  Mon Dec  4 22:32:59 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <14892.25067.641885.658966@taxus.athen1.ga.home.com>

Hello all;
As promised, I spent this weekend getting together a GenBank parser,
which I hope is something that we could include in Biopython in the
future. What I've got so far is available from:

http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001204.tar.gz

It has a nice distutils setup script and everything will install into
Bio.PGML directory (PGML = Plant Genome Mapping Lab -> that's my
little subdirectory to keep things I work on separate from Biopython).
The parser uses Martel-0.4, so you'll need to have that installed to
use this. Making this would definately not have been possible without
all  of the cool things in Martel, so we all definately have to give 
Andrew another big pat on the back for his awesome tool :-).

It is, I hope, a full featured GenBank parser that parses things into
SeqFeature classes. I'm hoping that these SeqFeature classes (or
something derived from them) will be something we can include in
Biopython as well. It would be really nice to have some "standard"
objects for features, to help us be more compatible with the Biocorba
and  BioXML projects. Anyways, the parser and seq features have 
the following exciting features:

* fully parses out Feature tables. This includes support for sub
Features (ie. the exons of a CDS object).

* deals with 'the dreaded fuzziness' in locations. There should be
support for all of the types of fuzziness, but I've tried to not make
it much more difficult to access locations if you don't care about 
fuzziness at all.

* parses into SeqRecord objects with Seq objects that are hopefully
AlphabetStrict in the proper manner.

I didn't write any docs on using these yet (I've got to get to work on 
things for lab and school now :-), but the parsers work like other
Biopython parsers like Blast (ie. with Iterators and Parsers). There
are also a couple of example scripts to get things going.

I'm really looking for feedback in the following areas:

1. Does this code look decent? Anyone besides me want to see this in
Biopython? 

2. Does this parser parse your favorite GenBank files? I've tested it
on a few things, but they are mostly plant sequences, since that's
what I've got around here. There is a script included in the tarball
"find_parser_problems.py", which will, if you run it on a GenBank
file, tell you what accession numbers, if any, cause parser
problems. If you could send me lists of accession numbers that break
it, it would really help to make sure it works in more cases.

3. Does the output you get have the same info as the initial GenBank
file (ie -- are there any ugly bugs)? I have another script included,
"check_output.py," which will spit out the parsed information to make
it possible to compare it with the initial GenBank file and see if I
screwed anything up. I've hand checked a couple of files, but it would 
really help to have other people debugging this as well.

4. What do people think about the SeqFeature classes? Like 'em? Hate
'em? Suggestions for improvement?

5. Can the code be speeded up/improved in any ways? Suggestions to
help me code better are always very welcome!

Thanks for listening and enjoy!

Brad


From jonathan.gilligan at vanderbilt.edu  Tue Dec  5 00:04:23 2000
From: jonathan.gilligan at vanderbilt.edu (Jonathan M. Gilligan)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] anon cvs access?
Message-ID: <5.0.1.4.0.20001204230020.022a9cd0@g.mail.vanderbilt.edu>

I cannot get anonymous cvs access to check out the biopython sources. 
Here's a transcript.

 >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login
(Logging in to cvs@cvs.biopython.org)
CVS password: ***
Fatal error, aborting.
cvs: no such user
cvs login: authorization failed: server cvs.biopython.org rejected access
 >

(with "cvs" for the password, as indicated at http://cvs.biopython.org/). 
Can anyone help me out here?

Thanks,
Jonathan
===========================================================================
Jonathan M. Gilligan                     <jonathan.gilligan@vanderbilt.edu>


From jchang at SMI.Stanford.EDU  Tue Dec  5 02:36:50 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <14892.25067.641885.658966@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU>

Hi Brad,

On Mon, 4 Dec 2000, Brad Chapman wrote:

> Hello all;
> As promised, I spent this weekend getting together a GenBank parser,
> which I hope is something that we could include in Biopython in the
> future. What I've got so far is available from:

Great!  We need a Genbank parser.

> It is, I hope, a full featured GenBank parser that parses things into
> SeqFeature classes. I'm hoping that these SeqFeature classes (or
> something derived from them) will be something we can include in
> Biopython as well. It would be really nice to have some "standard"
> objects for features, to help us be more compatible with the Biocorba
> and  BioXML projects.

Yes, I definitely agree with needing a general class.  However, I've been
purposefully shying away from proposing a general framework for
annotations for two main reasons.  First, it's a hard, unsolved problem
that we don't know how to do yet.  If you look at the models for biojava,
bioperl, and game, you'll see that there are 3 different partially
compatible solutions.  I suspect how you handle annotations is going to
depend on the purpose of the applications.  (Though I suppose "to store
genbank annotations" is a reasonable purpose).  The second reason is that
I like the idea of specific data structures for each database.  That way,
people that really care about, say, swissprot, will know how to retrieve
the data from their favorite field without having to muck around to see
how it's getting coerced into a one-size-fits-all framework.  If you can
only parse into a general data structure, then, since I don't believe a
single data structure can hold all the types of information from every
data base, you're bound to lose data.  I don't believe there's any general
data structure in existance that can handle the genbank location
field.  It's describe by a BNF grammar and requires a tree!


> Anyways, the parser and seq features have the following exciting
> features:
> 
> * fully parses out Feature tables. This includes support for sub
> Features (ie. the exons of a CDS object).
> 
> * deals with 'the dreaded fuzziness' in locations. There should be
> support for all of the types of fuzziness, but I've tried to not make
> it much more difficult to access locations if you don't care about 
> fuzziness at all.

Do we need to deal with genbank function like complement or order?

> * parses into SeqRecord objects with Seq objects that are hopefully
> AlphabetStrict in the proper manner.

I'm not sure that's a good thing for GenBank.  Does GenBank store the
alphabet for the sequence?  What if the sequence doesn't strictly follow
the alphabet?

> I didn't write any docs on using these yet (I've got to get to work on 
> things for lab and school now :-), but the parsers work like other
> Biopython parsers like Blast (ie. with Iterators and Parsers). There
> are also a couple of example scripts to get things going.
> 
> I'm really looking for feedback in the following areas:
> 
> 1. Does this code look decent? Anyone besides me want to see this in
> Biopython? 

- There's a TaggingConsumer in Bio.ParserSupport.  It looks like this does
something similar to _PrintConsumer.  It's supposed to be used for
debugging purposes so that you know what's getting passed when.  If it's
not appropriate, please let me know how to extend it so that it's more
generally useful.


> 2. Does this parser parse your favorite GenBank files? I've tested it
> on a few things, but they are mostly plant sequences, since that's
> what I've got around here. There is a script included in the tarball
> "find_parser_problems.py", which will, if you run it on a GenBank
> file, tell you what accession numbers, if any, cause parser
> problems. If you could send me lists of accession numbers that break
> it, it would really help to make sure it works in more cases.
> 
> 3. Does the output you get have the same info as the initial GenBank
> file (ie -- are there any ugly bugs)? I have another script included,
> "check_output.py," which will spit out the parsed information to make
> it possible to compare it with the initial GenBank file and see if I
> screwed anything up. I've hand checked a couple of files, but it would 
> really help to have other people debugging this as well.
> 
> 4. What do people think about the SeqFeature classes? Like 'em? Hate
> 'em? Suggestions for improvement?

Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py?  
It would prevent stuff like:
from Bio.SeqFeature import SeqFeature
or even worse,
from Bio.SeqFeature.SeqFeature import SeqFeature

> 5. Can the code be speeded up/improved in any ways? Suggestions to
> help me code better are always very welcome!

Thanks for doing this!

Jeff


From thomas at cbs.dtu.dk  Tue Dec  5 02:53:58 2000
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Re: plans for next release
Message-ID: <14892.40726.356431.482782@bb1.home>

> thomas wrote
> > Ok - I just came back from egypt. Of course there is no need at all for
> > using posix.posixpath - thats still left from my novice days :-)
> > I fix that and try to remove all Pmw widgets (its easy to implement the
> > scrolled things in pure Tk)
> >

> Cayte wrote
> 
>   Egypt sounds like a fascinating place to visit!  Late fall sounds like the
> right time, too, with cooler weather.

and nice snorkling too :-)
> 
>   Have you ever considered wxPython?  I have a tool, SeqGui.py, in wxPython,
> that's sort of like xbbtools.py  I find it easier to work with than Tkinter.
> Back in May, we had a thread about Gui support. Its in the archives.

#################
# It seems that my reply didn't make it through sendmail :-( - I try to
# reconstruct ...
#########

The main reasons for my sticking to Tkinter are the fact that I have used
Tcl/Tk a lot before I discovered python - I have tons of Tk snippets from
my previous bioinformatic work (Biowish, GRS, XBbtools, CapDB etc.) which
is very easy to convert into shorter, cleaner and more efficient python Tk
code. Maybe the biggest advantage in using Tkinter is the powerful Tk
Canvas, as far as I know neither wxPython or Gtk python have anything close
to the canvas widget.

> 
>   I'd like the gui to eventually support color highlighting of features, for
> example, regions of high consensus.
> 

I don't know how this works in wxPython, but in Tkinter it is already
there from the beginning. Every line, rectangle etc. you draw in the
canvas is an unique object and gets an id. You can very easy bind any event
(e.g. MouseOver, DoubleClickButton1 etc.) to any function. To highlight
different genes or sequence regions is just to group the according id's and
bind a color-change on a MouseOver event.

e.g. my recently accepted paper about phylogenomics with python (NAR nr2
2001) deals with the interactive display of all genes, phylogenetic
trees, blast results for a microbial genome (between 1000 and 5000 times
3). 
I have no fancy webpage yet but you can check a screenshot of the
phylome of the Bacteria Thermotoga maritima
at http://www.cbs.dtu.dk/thomas/pyphy/pyphy.png
(Phylome = set of all phylogenetic trees for a genome. 
 color coding for the kingdom of the closest neighbor in the phylogenetic
 tree: blue = Bacteria, yellow = Archaea, red = Eukarya)

Here the phylome map is an interactive display of all phylogenetic trees
and genes (colored lines in the circle), where each line/gene is sensitive
to mouse movement. A MouseOver event displays gene information in the top
Entry, Button1Click shows the phylogenetic tree, Button3 shows a gene
specific popupmenu for blastresults, alignments etc.
Each gene can be a member of a metabolic pathway, where selecting a pathway
in the right listbox changes the width and the arrow shape of each gene
associated (canvas tag) with the pathway.

The advantage here is zooming, resizing, moving and event grabbing is part
of the canvas widget so we only need to redraw single objects.

I have never worked with wxPython - what is exactly the strength of
wxWindows ? I guess it is faster than Tkinter, are there any special
features not found in the rest of the GUI family ?


c ya
-thomas

Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@biopython.org        The Technical University of Denmark
CBS:  +45 45 252485         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas/index.html

        De Chelonian Mobile ... The Turtle Moves ...

From dag at sonsorol.org  Tue Dec  5 07:46:25 2000
From: dag at sonsorol.org (chris dagdigian)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] anon cvs access?
In-Reply-To: <Pine.GSO.4.21.0012042338190.16937-100000@riboweb.Stanford.
 EDU>
References: <5.0.1.4.0.20001204230020.022a9cd0@g.mail.vanderbilt.edu>
Message-ID: <5.0.2.1.0.20001205074258.00a8ed40@fedayi.sonsorol.org>

Hey folks,

I happened to break anonymous CVS as a side effect of installing the web 
cvsview CGIs on Sunday afternoon. Sorry about that.

As partial repayment for the inconvenience, the biopython repository is now 
browsable at
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=biopython

Anon cvs should be working again, I just had to rebuild the CVS passwd and 
readers file. Drop me a line if anyone continues to have any troubles.

Regards,
Chris


At 11:39 PM 12/4/00 -0800, Jeffrey Chang wrote:
>There's been some reports of it failing for the other projects as
>well.  I'm forwarding your email to Chris Dagdigian to see if he knows
>what's going on.
>
>Jeff
>
>
>On Mon, 4 Dec 2000, Jonathan M. Gilligan wrote:
>
> > I cannot get anonymous cvs access to check out the biopython sources.
> > Here's a transcript.
> >
> >  >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login
> > (Logging in to cvs@cvs.biopython.org)
> > CVS password: ***
> > Fatal error, aborting.
> > cvs: no such user
> > cvs login: authorization failed: server cvs.biopython.org rejected access
> >  >
> >
> > (with "cvs" for the password, as indicated at http://cvs.biopython.org/).
> > Can anyone help me out here?


From bcohen at cs.sunysb.edu  Tue Dec  5 15:39:53 2000
From: bcohen at cs.sunysb.edu (barry cohen)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] grammar for .ffn files
In-Reply-To: <200012051704.MAA20834@pw600a.bioperl.org>
Message-ID: <Pine.SOL.4.21.0012051538490.28379-100000@sbskiena>

Is there an official document specifying the
syntax for the defline of a .ffn file?

barry cohen


From katel at worldpath.net  Wed Dec  6 01:40:50 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Re: plans for next release
References: <14892.40726.356431.482782@bb1.home>
Message-ID: <001d01c05f4f$7a51a500$010a0a0a@cadence.com>

> and nice snorkling too :-)
> >

   Do you have pictures of Egypt to post on the web?

> >   Have you ever considered wxPython?  I have a tool, SeqGui.py, in
wxPython,
> > that's sort of like xbbtools.py  I find it easier to work with than
Tkinter.
> > Back in May, we had a thread about Gui support. Its in the archives.
>
>
> The main reasons for my sticking to Tkinter are the fact that I have used
> Tcl/Tk a lot before I discovered python - I have tons of Tk snippets from
> my previous bioinformatic work (Biowish, GRS, XBbtools, CapDB etc.) which
> is very easy to convert into shorter, cleaner and more efficient python Tk
> code. Maybe the biggest advantage in using Tkinter is the powerful Tk
> Canvas, as far as I know neither wxPython or Gtk python have anything
close
> to the canvas widget.
>

  I think the Windows version is a wrapper around the Windows Gui and that
wxPython attempts to provide equivalent functionality in Linux.
> >
> >   I'd like the gui to eventually support color highlighting of features,
for
> > example, regions of high consensus.
> >
>
> I don't know how this works in wxPython, but in Tkinter it is already
> there from the beginning. Every line, rectangle etc. you draw in the
> canvas is an unique object and gets an id. You can very easy bind any
event
> (e.g. MouseOver, DoubleClickButton1 etc.) to any function. To highlight
> different genes or sequence regions is just to group the according id's
and
> bind a color-change on a MouseOver event.

> e.g. my recently accepted paper about phylogenomics with python (NAR nr2
> 2001) deals with the interactive display of all genes, phylogenetic
> trees, blast results for a microbial genome (between 1000 and 5000 times
> 3).
> I have no fancy webpage yet but you can check a screenshot of the
> phylome of the Bacteria Thermotoga maritima
> at http://www.cbs.dtu.dk/thomas/pyphy/pyphy.png
> (Phylome = set of all phylogenetic trees for a genome.
>  color coding for the kingdom of the closest neighbor in the phylogenetic
>  tree: blue = Bacteria, yellow = Archaea, red = Eukarya)
>
> Here the phylome map is an interactive display of all phylogenetic trees
> and genes (colored lines in the circle), where each line/gene is sensitive
> to mouse movement. A MouseOver event displays gene information in the top
> Entry, Button1Click shows the phylogenetic tree, Button3 shows a gene
> specific popupmenu for blastresults, alignments etc.
> Each gene can be a member of a metabolic pathway, where selecting a
pathway
> in the right listbox changes the width and the arrow shape of each gene
> associated (canvas tag) with the pathway.
>
> The advantage here is zooming, resizing, moving and event grabbing is part
> of the canvas widget so we only need to redraw single objects.
>
Does it support colorization with enough flexibility, to support research on
the fly, as in this scenario?

USER STORY:
   Ed Enzyme is doing some detective work on  an alignment.  First he
highlights the start and stop codons in red and green.  Then Ed zooms in on
an interesting sequence.  He first highights the hydrophilic regions in
magenta.  Then Ed backtracks and highlights the acidic regions.
>

> I have never worked with wxPython - what is exactly the strength of
> wxWindows ? I guess it is faster than Tkinter, are there any special
> features not found in the rest of the GUI family ?
>
>

  I found it was easier to work with.  With wxPython I could write more code
in the same time and fewer problems, like panels that don't quite line up.


Cayte


From katel at worldpath.net  Wed Dec  6 03:28:59 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] GenBank parser -- first go
References: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU>
Message-ID: <003901c05f5e$9622c700$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: "Brad Chapman" <chapmanb@arches.uga.edu>
Cc: <biopython-dev@biopython.org>
Sent: Monday, December 04, 2000 11:36 PM
Subject: Re: [Biopython-dev] GenBank parser -- first go


> Hi Brad,
>
> On Mon, 4 Dec 2000, Brad Chapman wrote:
>
> > Hello all;
> > As promised, I spent this weekend getting together a GenBank parser,
> > which I hope is something that we could include in Biopython in the
> > future. What I've got so far is available from:
>
   Does it strip html tags?  When I ran checkoutput.py, it produced this
output.


C:\gb_parser-20001204\Scripts>python check_output.py nutmeg.htm
Traceback (most recent call last):
  File "check_output.py", line 25, in ?
    iterator = GenBank.Iterator(handle, parser)
  File "c:\biopyt~1.90d\Bio\PGML\GenBank\GenBank.py", line 57, in __init__
    self._reader = RecordReader.StartsWith(handle, "LOCUS")
  File "c:\biopyt~1.90d\Martel\RecordReader.py", line 130, in __init__
    self.tagtable)
  File "c:\biopyt~1.90d\Martel\RecordReader.py", line 89, in
_find_begin_positio
ns
    raise ReaderError("invalid format starting with %s" % repr(text[:50]))
Martel.RecordReader.ReaderError: invalid format starting with '<!DOCTYPE
HTML PU

  The problem with conversions to text is that Netscape and Explorer and
probably others use different algorithms and produce different text output.

                                                                    Cayte


From jchang at SMI.Stanford.EDU  Wed Dec  6 01:08:08 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] anon cvs access?
In-Reply-To: <5.0.2.1.0.20001205074258.00a8ed40@fedayi.sonsorol.org>
Message-ID: <Pine.GSO.4.21.0012052207360.18298-100000@riboweb.Stanford.EDU>

Awesome!  Thanks for doing this, Chris.

I've put links to it from the biopython web pages.

jeff


On Tue, 5 Dec 2000, chris dagdigian wrote:

> 
> Hey folks,
> 
> I happened to break anonymous CVS as a side effect of installing the web 
> cvsview CGIs on Sunday afternoon. Sorry about that.
> 
> As partial repayment for the inconvenience, the biopython repository is now 
> browsable at
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=biopython
> 
> Anon cvs should be working again, I just had to rebuild the CVS passwd and 
> readers file. Drop me a line if anyone continues to have any troubles.
> 
> Regards,
> Chris
> 
> 
> At 11:39 PM 12/4/00 -0800, Jeffrey Chang wrote:
> >There's been some reports of it failing for the other projects as
> >well.  I'm forwarding your email to Chris Dagdigian to see if he knows
> >what's going on.
> >
> >Jeff
> >
> >
> >On Mon, 4 Dec 2000, Jonathan M. Gilligan wrote:
> >
> > > I cannot get anonymous cvs access to check out the biopython sources.
> > > Here's a transcript.
> > >
> > >  >cvs -d :pserver:cvs@cvs.biopython.org:/home/repository/biopython login
> > > (Logging in to cvs@cvs.biopython.org)
> > > CVS password: ***
> > > Fatal error, aborting.
> > > cvs: no such user
> > > cvs login: authorization failed: server cvs.biopython.org rejected access
> > >  >
> > >
> > > (with "cvs" for the password, as indicated at http://cvs.biopython.org/).
> > > Can anyone help me out here?
> 
> 


From chapmanb at arches.uga.edu  Wed Dec  6 02:21:38 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU>
References: <14892.25067.641885.658966@taxus.athen1.ga.home.com>
	<Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU>
Message-ID: <14893.59650.80411.133478@taxus.athen1.ga.home.com>

Hey Jeff,
Thanks a lot for taking a look at the code!

[SeqFeature classes]
> Yes, I definitely agree with needing a general class.  However, I've been
> purposefully shying away from proposing a general framework for
> annotations for two main reasons.  First, it's a hard, unsolved problem
> that we don't know how to do yet.  If you look at the models for biojava,
> bioperl, and game, you'll see that there are 3 different partially
> compatible solutions.  I suspect how you handle annotations is going to
> depend on the purpose of the applications.  (Though I suppose "to store
> genbank annotations" is a reasonable purpose).  

Agreed. I think our chances of getting it perfect are pretty slim
:-). However, I think it would really help writing "applications that
use Biopython" to have some kind of general class to work off of (even 
if it is imperfect). It is just too much work to have to support
Genbank record classes and EMBL record classes and whatever record
classes. I imagine this would be especially painful for writing user
interfaces. I'm not sure if my SeqFeature classes are the best thing
ever, but it is just meant as a bit of a start. I am definately
willing to throw out/ammend lots of what I wrote if people have got
ideas for changing them to be better.

> The second reason is that
> I like the idea of specific data structures for each database.  That way,
> people that really care about, say, swissprot, will know how to retrieve
> the data from their favorite field without having to muck around to see
> how it's getting coerced into a one-size-fits-all framework.  

Also agreed. I'll work on a GenBank specific Record class as well, as
I think these make it much easier for people who just want to parse
out GenBank.

> If you can
> only parse into a general data structure, then, since I don't believe a
> single data structure can hold all the types of information from every
> data base, you're bound to lose data.  I don't believe there's any general
> data structure in existance that can handle the genbank location
> field.  It's describe by a BNF grammar and requires a tree!

Very true :-). When putting the "GenBank specific stuff" into the
SeqFeature classes I ended up dumping a lot of it into dictionaries
(like Andrew's annotations dictionary in the SeqRecord class). A
GenBank specific record would definately hold the information in a lot 
more readily accessible format.

> Do we need to deal with genbank function like complement or order?

I'm trying to deal with them (although I forgot to do order! Thanks
for mentioning it). I'm dealing with them in the following way:

complement - I mark the feature as being on the opposite strand (I'm
using a 1, 0, -1 scale like BioCorba -- so -1 is the opposite strand).

join - The top level feature has a location from the start to the end
of the join. The feature then has sub SeqFeatures (also borrowed from
Biocorba) which are the individual exons (or whatever) in the
join. Right now if the top level feature is a CDS, then the sub
Features are labelled as type CDS_span. I think I'll change this to
CDS_join to make it clear they are part of a join.

order - I'll treat this like join, except call the sub features
CDS_order. 

It should also be able to deal with nested locations, like
complement(join(location,location)).

Does this all sound reasonable?

[Alphabets for GenBank]
> I'm not sure that's a good thing for GenBank.  Does GenBank store the
> alphabet for the sequence?  What if the sequence doesn't strictly follow
> the alphabet?

Well, GenBank doesn't really store the alphabet (it does give a base
count for common bases (AGTC) but then specifies anything else as
"other" which isn't very useful for our purposes). What I do is
remember the type from the GenBank file (DNA, RNA, PROTEIN) and then
give the sequence this alphabet. I use the ambiguous DNA and RNA
alphabets so this should cover any letters in the sequence
(hopefully). I'm not sure if this is ideal, but at least it associates 
the type with the sequence. Suggestions about how to be more strict
are welcome on this.

> - There's a TaggingConsumer in Bio.ParserSupport.  It looks like this does
> something similar to _PrintConsumer.  It's supposed to be used for
> debugging purposes so that you know what's getting passed when.  If it's
> not appropriate, please let me know how to extend it so that it's more
> generally useful.

Oh sorry, I meant to document that. TaggingConsumer is great. I just
used the PrintConsumer as I was coding this so that I would make sure
I added all of the necessary callbacks and didn't forget any
information. This was more for my use in coding then for anything
else. Once I was done building the parser, I just copy/pasted it and
used it to build the Feature consumer. I don't think it is worthwhile
to actually include in a final version, but I saved it because I'll
probably need to copy and paste it again to write a
RecordConsumer. But anyways, it was just a coding tool -- for later
debugging TaggingConsumer is great for me. PrintConsumer was just my
way to reduce the number of bugs in my code :-).
> Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py?  
> It would prevent stuff like:
> from Bio.SeqFeature import SeqFeature
> or even worse,
> from Bio.SeqFeature.SeqFeature import SeqFeature

Well, I was going to recommend to use it like this:

from Bio import SeqFeature
my_feature = SeqFeature.SeqFeature.SeqFeature()

:-) Seriously, this is indeed very ugly. Another possible solution
would be to put all of the Features in a directory called Feature or
Features (instead of SeqFeature) so then the imports would look like:

from Bio.Feature import SeqFeature

Either way is fine, though (or I'm very open to additional
suggestions), so whatever you think.

Thanks again for taking a look at this. I'll try to produce another
version based on this (and any future comments) for next week.

Brad


From chapmanb at arches.uga.edu  Wed Dec  6 02:31:36 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <003901c05f5e$9622c700$010a0a0a@cadence.com>
References: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU>
	<003901c05f5e$9622c700$010a0a0a@cadence.com>
Message-ID: <14893.60248.442490.847078@taxus.athen1.ga.home.com>

Hi Cayte;
Thanks for trying this out!

[GenBank parser]
>    Does it strip html tags?  When I ran checkoutput.py, it produced this
> output:

[the parser doesn't like html]

>   The problem with conversions to text is that Netscape and Explorer and
> probably others use different algorithms and produce different text output.

GenBank is a flat file format, like FASTA, so all of the html markup 
that NCBI or whoever puts in is just arbitrary to "beautify" it for
the web. 

You should be able to get the text GenBank version of any record
without having to do a "save as text" on an html page. On the NCBI
page, there is a Text button at the top of a list of records that 
will give you the flat-file text version of a record you searched 
for using Entrez. You can then save this as text, and it'll be 
consistent between browsers. 

Once you get this the parser should be happier with the file :-).

Let me know if this doesn't help.

Brad


From dalke at acm.org  Wed Dec  6 03:10:36 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <001401c05f5c$056e0460$95ac323f@josiah>

Brad:
>I spent this weekend getting together a GenBank parser,
>which I hope is something that we could include in Biopython in the
>future.

Wow!  I'm glad that people other than me can use it.  I've been
working on it for so long now that I don't have a good idea of what it
means to come into it from scratch.

>we all definately have to give
>Andrew another big pat on the back for his awesome tool :-).

Thank you.

>5. Can the code be speeded up/improved in any ways? Suggestions to
>help me code better are always very welcome!

I look over it and it seems quite good.  I do have some comments,
which I've included here.  (Most of the points are relevant to
Python and Martel programming, so I've sent it to the list instead
of just you directly.)


You use

cur_record = iterator.next()
while cur_record:
  ...
  cur_record = iterator.next()

The standard Python idiom is

while 1:
  cur_record = iterator.next()
  if not cur_record:
    break
  ...


Instead of
  indent_space = Martel.RepN(Martel.Str(" "), 2)

it is better to do (note: two spaces)
  indent_space = Martel.Str("  ")

They give the same result, but "  " gets checked with a single
test while two " "s is done with two tests.  This is even more
appropriate with
  qualifier_space = Martel.RepN(Martel.Str(" "), 21)

(Actually, there is a Martel.optimize module which contains the
function 'merge_strings'.  It should merge " "," " into " " but it
isn't automatically used.)


As an aside, I can see I've been focused on regexps compared to
you.  You have
  blank_space = Martel.Rep1(Martel.Str(" "))
where I would use
  blank_space = Martel.Re(" *")
There's no difference in implementation - the decision on which one to
use is based on readability/usability.  I (sadly?) know a lot about
regexps so I probably find them more usable :)

Case in point, I probably would have written the LOCUS definition with
more regexps.

def choice(tag, words):
  exp_list = map(Martel.Str, words)
  return Martel.Group(tag, exp_list)

residue_types = choice("residue_type",
                       ["DNA", "RNA", "mRNA", "PROTEIN"])

data_file_division = choice("data_file_division",
              ["PRI", "ROD", "MAM", "VRT", "INV", "PLN", "BCT", "RNA",
               "VRL", "PHG", "SYN", "UNA", "EST", "PAT", "STS", "GSS",
               "HTG"])

date = Martel.Group("date",
     Martel.Re("(?P<day>\d+)-(?P<month>[A-Z]+)-(?P<year>\d+)))

locus_line = Martel.Re("LOCUS +(?P<locus>\w+) +(?P<size>\d+) bp +") + \
             residue_types + blank_space + data_file_division + \
             blank_space + date + Martel.AnyEol()

Interestingly, I didn't use a regexp for residue_types like you do.
That's because I worry about people looking at a list of strings and
thinking they can add arbitrary text - forgetting about escapeing
regex characters.

Skipping ahead, the 'valid_f_keys' might someday include characters
like '+' and '.', so you really should use a function like the one I
gave above.


Hmmm.  I've been pickier than you about ignoring the leading
whitespace.  For example, you have


# definition line
# DEFINITION  Genomic sequence for Arabidopsis thaliana BAC T25K16 from
#             chromosome I, complete sequence.

definition = Martel.Group("definition",
                          Martel.Rep1(blank_space +
                                      Martel.ToEol()))

definition_line = Martel.Group("definition_line",
                               Martel.Str("DEFINITION") +
                               definition)

By comparison, I would enforced that the text be folded to start on
column 13 using

definition_line = Martel.Group("definition_line",
             Martel.Str("DEFINITION ") + Martel.ToEol("definition") + \
  Martel.Rep(Martel.Str("           ") + Martel.ToEol("definition")))

Here's a justification for this.  It's already common practice with
GenBank files to have subitems indented under the major item.  For
example,

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana

Suppose some day the powers that be add a subitem to the DEFINITION
DEFINITION  Genomic sequence for Arabidopsis thaliana BAC T25K16 from
            chromosome I, complete sequence.
  BLAH      Abcd Ef Ghijkl.

I consider it a good thing for the parser to break at this point,
rather than include the "BLAH      Abcd Ef Ghijkl." text as part of
the definition.

It seems the form of
  "LABEL"    text starts on column 13
             and may fold over multiple
             lines

is pretty common.  If that's the case, you can make things simpler by
using a functions to make the definition.

INDENT = 12
def make_line(label, line_name, data_name):
    assert len(label) < INDENT, "label text too long"
    first_line = Martel.Str(label + " " * (INDENT - len(label))) + \
                 Martel.ToEol(data_name)
    other_lines = Martel.Str(" " * INDENT) + \
                  Martel.ToEol(data_name)
    return Martel.Group(line_name, first_line + Rep(other_lines))


definition_line = make_line("DEFINITION", "definition_line",
                            "definition")

accession_line = make_line("ACCESSION", "accession_line", "accession")


(If you were really trusting you could use just the line label and
string.lower it to get the data name, and add "_line" to that to get
the line name.  I'm usually more explicit than that.)

\w includes \d so you don't need to do [\w\d] for the nid

For that matter, [\d]+ is the same as \d+ (as in the gi)


I'm still undecided about the AtBeginning and AtEnd commands.  (You
use the former in the version_line definition.)  I don't know if they
should test for beginning/end of line or beginning/end of input text.
In fact, I would rather not have them at all, since they don't work
well with the RecordReader idea, where the text is broken up into
parts.  There's no real need for AtBeginning here so you can remove
it.

With the function definition above, you can also replace

  keywords_line = make_line("KEYWORDS", "keywords_line", "keywords")
  source_line = make_line("SOURCE", "source_line", "source")


Out of curiosity, why do arab1.gb and cor6_6.gb have different
ORGANISM lines?

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
Tracheophyta;
            euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
            Rosidae; Capparales; Brassicaceae; Arabidopsis.

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana
            Eukaryota; Viridiplantae; Embryophyta; Tracheophyta;
Spermatophyta;
            Magnoliophyta; eudicotyledons; core eudicots; Rosidae; eurosids
II;
            Brassicales; Brassicaceae; Arabidopsis.

The first has "Streptophyta", "euphyllophytes" and "Capparales".

The second has "core eudicots", "eurosids II" and "Brassicales".


Martel includes a helper function called 'Integer' which can simplify
some of your definitions, as with

reference_num = Martel.Integer("reference_num")
pubmed_id = Martel.Integer("pubmed_id")


Here's another preference of mine.  You have

  sequence = Martel.Group("sequence",
                          Martel.Re("[\w ]+"))

where I would do
  sequence = Martel.ToEol("sequence")

The difference is that you only accept a-zA-Z0-9_ and the space
character.  It doesn't accept "-" or "*", which I can see possibly
getting into the data.  Since there's no need to validate the
characters, might as well just consume anything you find.

(Note to self - need to use \R for the ToEol definition.)


Looking at GenBank.py, you use the record definition thusly:

        parser = genbank_format.record.make_parser(debug_level = debug)
        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())

        parser.parseFile(handle)

You really should cache the created parser.  It can take quite some
time to generate.  During my PIR tests, about 98% of time of some of
the tests were spent doing generation rather than parsing.

I think you do it because you want to allow different debug levels.
You can still support that in a couple of ways.  Here's one:

class _Scanner:
    def __init__(self):
        self._cached_parsers = {}
    def feed(self, handler, consumer, debug = 0)
        parser = self._cached_parsers.get(debug)
        if parser is None:
            parser = self._cached_parsers[debug] = \
                genbank_format.record.make_parser(debug_level = debug)

        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())
        parser.parseFile(handle)

Here's another which is tuned for smaller memory use and the
assumption that you almost never change debug levels.

class _Scanner:
    def __init__(self):
        self._cached_parser = None
        self._cached_debug = None
    def feed(self, handler, consumer, debug = 0)
        if self._cached_debug == debug:
            parser = self._cached_parser
        else:
            parser = self._cached_parser = \
                genbank_format.record.make_parser(debug_level = debug)
            self._cached_debug = debug

        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())
        parser.parseFile(handle)

You have a list of tags you are interested in receiving.  Martel has a
function to create a new expression tree based but only sending back
the events you are interested in receiving.

   expression = genbank_format.record
   expression = Martel.select_names(expression, interest_tags)
   parser = expression.make_parser()

BTW, this was implemented because Python's function call overhead is
pretty large so I wanted a way to reduce the number of calls if I knew
an event wasn't needed.

Replace
         fun_to_call = eval("self._consumer" + "." + name)
with
         fun_to_call = getattr("self._consumer",  name)


You should also do
         fun_to_call(info_to_pass)
instead of
         apply(fun_to_call, (info_to_pass,))


Here's a cute implementation for _PrintConsumer (untested)

# Let's you define new labels if you want them
event_name_converter = {
  "start_feature_table": "Starting feature table",
  "record_end": "End of Record!",
}

class _PrintWrapper:
  def __init__(self, name):
    self.name = name
  def __call__(self, content):
    print "%s: %s" % (self.name, content)

class _PrintConsumer:
  def __init__(self):
    self.data = 'blah'
  def __getattr__(self, name):
    if name[:2] == "__":
      raise AttributeError, name
    return _PrintWrapper(event_name_converter.get(name, name))


                    Andrew
                    dalke@acm.org


From dalke at acm.org  Wed Dec  6 03:12:29 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] bug in Martel-0.4
Message-ID: <001501c05f5c$47ecb8e0$95ac323f@josiah>

There's a bug in Martel-0.4 and earlier versions.

Suppose you have  ([<>][ABC])+[<>]?
and want to match it against

   <A<B<

The "<A" matches the first [<>][ABC].  The "<B" matches
the second [<>][ABC].  The parser tries to match the final
"<" against [<>][ABC] and should fail then try to match
the "<" against [<>]? .

The bug was that it would match the "<" against the [<>] in
[<>][ABC] and fail at that point.  It gives an assertion error
about "l" being greater than "r".

Here's the patch.  The only consequence should be a small hit
in performance.

Index: Generate.py
===================================================================
RCS file: /home/dalke/cvsroot/Martel/Generate.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -r1.18 -r1.19
271c271
<         result.append( (None, TT.SubTable, tuple(tagtable)) )
---
>         result.append( (">ignore", TT.Table, tuple(tagtable)) )
275c275
<         result.append( (None, TT.SubTable, tuple(tagtable),
---
>         result.append( (">ignore", TT.Table, tuple(tagtable),


(Okay, there are other bugs, but this is one which is part
of the core code and is hard to figure out or work around.)

                    Andrew


From dalke at acm.org  Wed Dec  6 03:16:16 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] PIR parsing
Message-ID: <001601c05f5c$cf21c260$95ac323f@josiah>

I've written a much more complete PIR CODATA parser which works with
the latest PIR release (Release 66.00, September 30, 2000).  I tested
it against pir1.dat and pir3.dat.

The PIR format is somewhat nasty, but not as bad as I thought it would
be.  It's like several other formats in that long fields fold over to
the next (indented) lines.  The only major problem was that the folded
lines themselves can contain multiple elements, like

FEATURE
   2-105               #product cytochrome c #status experimental #label
                       MAT\

 or in XML with some extra newlines ... :)

   <feature_range><begin_pos>2</begin_pos>-<end_pos>105</end_pos>
</feature_range>               #product <product>cytochrome c
</product> #status <status>experimental</status> #label
                       <label>MAT\</label>


Some of the fields don't have the #elements at all, but the
implementation is pretty strict and it checks that words inside of the
text field do not start with a '#'.  That check makes the pattern
quite gnarly but is needed to ensure I'm not missing an element by
accident.

The new module is (temporarily) at
    http://www.biopython.org/~dalke/PIR_3_0.py  .
It should work fine, but it hasn't been tested against a real need
(like generating HTML or data structures) so will likely changed
as those needs are resolved.  Also, the indentation level has
changed from release 65 so it probably won't work with anything
other than the most recent version.

Some things to do for the future:
  o rewrite to clean things up, now that the format is known (some
       of the definitions are scaffolding to explore the format)
  o choose better names
  o parse more of the format
      - identify parts of the journal references
      - make each component accessible in a semi-colon delimited list

BTW, the callback overhead for this format is about a factor of
4 more than the parsing part.  The PIR format intermingles sequence
letters and markup about the residue - one letter of one then one
letter of the other.  So every sequence character creates three
function calls!  (begin, character and end.)

                    Andrew
                    dalke@acm.org


From dalke at acm.org  Wed Dec  6 03:39:45 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <001e01c05f60$16d48720$95ac323f@josiah>

Jeff:
>> I don't believe there's any general
>> data structure in existance that can handle the genbank location
>> field.  It's describe by a BNF grammar and requires a tree!

Speaking as a parsing problem, this cannot be done with regular
expression.  When something like that occurs, it should be fine
to leave it as an opaque block of text, which is parsed elsewhere.

John Aycock wrote a really nice context-free parser in pure
Python called SPARK.  http://www.csr.uvic.ca/~aycock/python/
Easier to use.  (Which means it is *much* easier to use than
lax/yacc.)

Brad:
>I use the ambiguous DNA and RNA
>alphabets so this should cover any letters in the sequence
>(hopefully). I'm not sure if this is ideal, but at least it associates 
>the type with the sequence. Suggestions about how to be more strict
>are welcome on this.

You could be more strict by being less strict.  There's a
ProteinAlphabet, DNAAlphabet and RNAAlphabet as part of the
Bio.Alphabet module.

You can't really do anything with them.  All they say is that
sequence contains a single letter of alphabet containing protein,
dna or rna residues.  It doesn't attempt to define what those
letter means.

Jeff:
>> - There's a TaggingConsumer in Bio.ParserSupport.

Oops!  You can see I haven't read that bit of code.  I included
something pretty much like that in my earlier reply to Brad.

                    Andrew
                    dalke@acm.org


From chapmanb at arches.uga.edu  Wed Dec  6 03:50:55 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Martel-0.4 available
In-Reply-To: <000901c05b7e$505616c0$efab323f@josiah>
References: <000901c05b7e$505616c0$efab323f@josiah>
Message-ID: <14893.65007.146784.658490@taxus.athen1.ga.home.com>

Hi Andrew;
Sorry I haven't had a chance to comment on new Martel features yet
-- I have a bit of feedback in the areas you mentioned based on 
working with it for writing the GenBank parser.

>   New regexp syntax - \R
>      \R    means "\n|\r\n?"
>      [\R]  means "[\n\r]"
> 
>   New Expression Node - AnyEOL
>      implements the \R test

In general, the \R syntax worked great for me. I'm not a regexp purist 
or anything, so I have no issues with adding this. The new feature of
being able to handle any kind of line feed is very nice. One thing
that I ended up doing was not using the AnyEOL test at all, and
instead only using the \R syntax. As I starting using it I realized
why it was so nice to be able to embed the \R inside of any regular
expression, so I ended up only using \R to be consistent (so I used
Martel.Re("\R") to detect end of lines. Just thought I would mention
it if it helpful to you. But in general, \R seems great by me.

I also thought it would be nice if the RecordReader would accept \R as 
a newline as well, so you could do something like
RecordRecorder.EndsWith(handle, "//\R"). Even further along these
lines, it would have been nice to be able to set the end with an
arbitrary regular expression. For GenBank, I would have wanted
"//[\R]+" (okay, I would have to escape those //'s, but I'm not sure
how many /s that would leave me with :-), so that  the end would be
// plus an arbitrary number of newlines. I ran into problems with
files like the biojava genbank test file, where there are a bunch of
linefeeds at the end of the file, but this could be a problem with a
file of cut'n'pasted records that had differing amounts of
linebreaks. I was able to get around this for GenBank by using
StartsWith(handle, "LOCUS"), but just thought I would mention the thought.

>   RecordReaders rewritten to use mxTextTools to find record
> begin and end characters rather than using readline/readlines.

I have a quick question about mxTextTools importing -- you are now
importing with:

from mx import TextTools

When did it get a mx meta-directory? Is this a new version or anything 
fancy? It was no big deal, I was just curious.

>   - how to make an iterator (would like a bit more feedback)

(pausing to read your other mails right now... thanks for the
feedback!)

One thing that I didn't use is a Martel based iterator -- I just stuck 
with the type of iterator that Jeff uses in other Biopython parsers 
but used the RecordReader to implement it. I'm not sure if it could be 
done in a better way with a Martel iterator...

BTW, the debug_level = 2 option on the parser is incredibly nice. It
really helps get at why a parse is failing and makes it much easier to 
correct the problem. I probably would still be pulling my hair out
trying to regexp right without this. Thanks!

Brad


From dalke at acm.org  Wed Dec  6 04:53:23 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Martel-0.4 available
Message-ID: <006101c05f6a$607c39e0$95ac323f@josiah>

Brad:
>One thing
>that I ended up doing was not using the AnyEOL test at all, and
>instead only using the \R syntax.

Admittedly, the exising AnyEol uses the old "\n" test so
won't work on non-UNIX platforms.  Still, AnyEol() should
be just as good as using Re(r"\R").  (Partially because you
really should be using a raw quoted string - it works because
Python's normal strings currently don't do anything with \R.)

>I also thought it would be nice if the RecordReader would accept \R as 
>a newline as well, so you could do something like
>RecordRecorder.EndsWith(handle, "//\R").

Some of the readers allow a trailing "\n".  This gets interpreted
to mean \R.  That changes the definition of "\n", which is probably
a bad idea.  I used "\n" because it's one character and not as
likely to be confused with other characters.  It shouldn't be
too hard to change to use \R instead.

> Even further along these
>lines, it would have been nice to be able to set the end with an
>arbitrary regular expression.

Indeed, that would be a final goal for Martel.  I can't do it.
If I could then your delimiter would be the Genbank record
definition itself and there would be no need for a RecordReader.

The problem is that I can't tell when mxTextTools reaches the
end of the string.  I would like it to ask "I've parsed this
data, got any more before I call it the end of input?".  All
I know now is that the parse failed, but it could be because
the text was in the wrong format or it needed more data to
finish the check.  I could keep on making the string larger
and larger, but when would I stop?

BTW, that "make the string larger and larger" is what I do
with the StartsWith and EndsWith.  That only works because I
know exactly the contents of the string so I know the failure
conditions, and because the record sizes are usually a lot
smaller than the lookahead buffer so I don't have the N**2
case of appending strings and retesting.

>I ran into problems with
>files like the biojava genbank test file, where there are a bunch of
>linefeeds at the end of the file, but this could be a problem with a
>file of cut'n'pasted records that had differing amounts of
>linebreaks.

If you use the HeaderFooter parser, you have an empty header
and a footer which matches "\R*".  See the PIR example which
allows a trailing \\\ .

When it reads past the final /// it will try to parse the
newlines as a record.  That will fail, so it passes the text
off to the footer parser.

Another nice thing about the Record Parsers - if there's an error
when processing a record, it's an 'error' but not a 'fatalError'.
It can recover by processing the next record.

>I have a quick question about mxTextTools importing -- you are now
>importing with:
>
>from mx import TextTools
>
>When did it get a mx meta-directory? Is this a new version or anything 
>fancy? It was no big deal, I was just curious.

Oops, didn't realize I was doing that.  I'm using a prerelease
version of mxTextTools 1.2 which changes the organization.  I
really should use just TextTools.  (1.2 has backwards compatible
support for that.)

>One thing that I didn't use is a Martel based iterator -- I just stuck 
>with the type of iterator that Jeff uses in other Biopython parsers 
>but used the RecordReader to implement it. I'm not sure if it could be 
>done in a better way with a Martel iterator...

Depends on the needs.  From what I saw of your adapter, it was
pretty straight match between the two.

>BTW, the debug_level = 2 option on the parser is incredibly nice. It
>really helps get at why a parse is failing and makes it much easier to 
>correct the problem. I probably would still be pulling my hair out
>trying to regexp right without this. Thanks!

I agree.  I was working on the PIR parser and having the correct
byte position (debug_level = 1) was wonderful.  Then when I got
really confused, I upped it to 2 to get an idea of what it was
attempting to parse.

                    Andrew
                    dalke@acm.org


From katel at worldpath.net  Wed Dec  6 23:57:49 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
References: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU><003901c05f5e$9622c700$010a0a0a@cadence.com> <14893.60248.442490.847078@taxus.athen1.ga.home.com>
Message-ID: <003601c0600a$42b56ee0$010a0a0a@cadence.com>

----- Original Message -----
From: "Brad Chapman" <chapmanb@arches.uga.edu>
To: "Cayte" <katel@worldpath.net>
Cc: <biopython-dev@biopython.org>
Sent: Tuesday, December 05, 2000 11:31 PM
Subject: Re: [Biopython-dev] GenBank parser -- first go


>
> You should be able to get the text GenBank version of any record
> without having to do a "save as text" on an html page. On the NCBI
> page, there is a Text button at the top of a list of records that
> will give you the flat-file text version of a record you searched
> for using Entrez. You can then save this as text, and it'll be
> consistent between browsers.
>
> Once you get this the parser should be happier with the file :-).
>
  Its happier with the text file.  The problem now is ye olde machine
independent line-feed.  The features and annotations run way to the right
with some embedded octal 012s.  My system is Win98.  Its probably fine on
Unix and Linux.

                                                           Cayte


From dalke at acm.org  Wed Dec  6 22:50:25 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <013301c06000$d81fc0c0$95ac323f@josiah>

Cayte:
> Its happier with the text file.  The problem now is ye olde machine
> independent line-feed.  The features and annotations run way to the
> right with some embedded octal 012s

That's my doings, I'm afraid.  I didn't change a couple of the
definitions to use the new \R syntax.  One is the 'ToEol()'
command, which Brad uses in his code.

The fix should be to change Martel/__init__.py from

    if name is None:
        return Re(".*\n")
    else:
        return Group(name, Re(".*")) + Str("\n")

to

    if name is None:
        return Re(r"[^\R]*\R")
    else:
        return Group(name, Re(r"[^\R]*")) + Re(r"\R")

but I haven't tested it to make sure that's correct.

                    Andrew


From katel at worldpath.net  Thu Dec  7 02:03:20 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Eol
References: <006101c05f6a$607c39e0$95ac323f@josiah>
Message-ID: <005801c0601b$c8e97360$010a0a0a@cadence.com>

  Should the last line of text have an implicit Eol?  This test assumes it
should, but the test failed.  A test that's identical, except that the
target text ends with a newline, passed.


        exp1 = Martel.ToEol()
        exp2 = Martel.ToEol()
        exp3 = Martel.ToEol()
        expression = exp1 + exp2 + exp3
        print expression
        tagtable, want_flg = Martel.Generate.generate( expression )
        success = tag( "abcdefghij\nOPQRSTUVWXY\n0123456789", tagtable )[
0 ]
 return ( self.assert_condition( success == 1, "Failed" ) )


                     Cayte


From dalke at acm.org  Wed Dec  6 23:24:58 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Eol
Message-ID: <013c01c06005$a9db5300$95ac323f@josiah>


>  Should the last line of text have an implicit Eol?  This test
>assumes it should, but the test failed.  A test that's identical,
>except that the target text ends with a newline, passed.

The expression:
>        exp1 = Martel.ToEol()
>        exp2 = Martel.ToEol()
>        exp3 = Martel.ToEol()
>        expression = exp1 + exp2 + exp3

requires a final newline.  It's possible to write an expression
which doesn't need that, as with

  exp3 = Martel.Re(r"[^\R]*\R?")

As written, it is hard in Martel to make the ToEol expression
automatically recognize that a final newline is not needed.  It
could be written as
    [^\R]*(\R|$)
assuming that $ was changed to mean "end of text" rather than
end of line as I believe it does now.  (I mentioned yesterday
that I don't like the ^ and $ assertions.)

Instead, it is easier (not necessarily better!) if the format
author defines the last line to have an optional \R.

Still, complications arise from interactions with the record
readers.  They read a record at a time and pass the string
over to the parser.  The '$' will match at the end of that
string even though in the full format (non-record reader based)
it would not have matched.

After a bit of thought I realize that's a knee-jerk reaction.
That isn't a big concern since there are similar problems
already.  For example, if the record parser uses "(.|\n)*" it
will read up to the end of the record, but in the full format
would read the whole file.

Another solution is to have a specialzed ToEol (either a
new function or an optional argument) which generates the
"\R?" form.

Finally, I don't think this is much of an issue for real
formats.  All the ones I've tested so far have a final newline,
although I don't expect that to always be the case.  In
addition, the last line is usually well defined so a ToEol
(special or otherwise) isn't needed.  Eg, it can be defined
with Re(r"///\R?") or Re(r"END\R?").

I'll point out that the record readers are designed so that
a final newline is not needed for the record.  Thus, any
problems with a missing newline should be completely handleable
by an appropriate format definition.

                    Andrew
                    dalke@acm.org


From jchang at SMI.Stanford.EDU  Thu Dec  7 19:24:23 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <14893.59650.80411.133478@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0012071608210.21689-100000@riboweb.Stanford.EDU>

> [SeqFeature classes]
> > Yes, I definitely agree with needing a general class.  However, I've been
> > purposefully shying away from proposing a general framework for
> > annotations for two main reasons.
> > [Blah blah]

> Agreed. I think our chances of getting it perfect are pretty slim
> :-). However, I think it would really help writing "applications that
> use Biopython" to have some kind of general class to work off of (even 
> if it is imperfect). It is just too much work to have to support
> Genbank record classes and EMBL record classes and whatever record
> classes.

That sounds reasonable.  Yes, having specific classes for every format is
a lot of work.  It's fine to map directly into a general class, since
bioperl shows that it's still useful for people.


> I'm not sure if my SeqFeature classes are the best thing ever, but it
> is just meant as a bit of a start. I am definately willing to throw
> out/ammend lots of what I wrote if people have got ideas for changing
> them to be better.

A good test of its generality is to see whether you can map the data from
other classes, e.g. Fasta.Record, SwissProt.SProt.Record, or
Medline.Record into it.

> from Bio.Feature import SeqFeature
> 
> Either way is fine, though (or I'm very open to additional
> suggestions), so whatever you think.

Are there going to be other Features not applied to sequences, such as
StructFeature?  

I don't think there should be a separate package for Feature.  The
SeqFeature stuff should be close to the Seq class.

Jeff


From katel at worldpath.net  Fri Dec  8 00:21:12 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
References: <Pine.GSO.4.21.0012042311190.16832-100000@riboweb.Stanford.EDU><003901c05f5e$9622c700$010a0a0a@cadence.com> <14893.60248.442490.847078@taxus.athen1.ga.home.com> <003601c0600a$42b56ee0$010a0a0a@cadence.com>
Message-ID: <003401c060d6$af732280$010a0a0a@cadence.com>

> ----- Original Message -----
> From: "Brad Chapman" <chapmanb@arches.uga.edu>
> To: "Cayte" <katel@worldpath.net>
> Cc: <biopython-dev@biopython.org>
> Sent: Tuesday, December 05, 2000 11:31 PM
> Subject: Re: [Biopython-dev] GenBank parser -- first go
>
>
> >
> > You should be able to get the text GenBank version of any record
> > without having to do a "save as text" on an html page. On the NCBI
> > page, there is a Text button at the top of a list of records that
> > will give you the flat-file text version of a record you searched
> > for using Entrez. You can then save this as text, and it'll be
> > consistent between browsers.
> >
>
   This should be fine for the first go.  For some later go, I think we
should strip the xml/html.  If there are multiple ways of manually
converting to text, you can just about guarantee all of them will be used
sooner or later.  As much as possible, manual editing should be replaced
with automated enhancements.

  There are some difficulties with conversion to text, because html/xml
isn't tied to the newline mechanism.  It can position lines anyway it likes
with any kind of fonts.  Genbank may be one step away from a flat file, but
its not true of all databases.  Rebase and Gobase are examples.

                   Cayte


From dalke at acm.org  Fri Dec  8 04:04:36 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <002501c060f5$e4b205a0$9cac323f@josiah>

Jeff:
>> I don't believe there's any general
>> data structure in existance that can handle the genbank location
>> field.  It's describe by a BNF grammar and requires a tree!

Me:
>Speaking as a parsing problem, this cannot be done with regular
>expression.  When something like that occurs, it should be fine
>to leave it as an opaque block of text, which is parsed elsewhere.
>
>John Aycock wrote a really nice context-free parser in pure
>Python called SPARK.  http://www.csr.uvic.ca/~aycock/python/
>Easier to use.  (Which means it is *much* easier to use than
>lax/yacc.)

And here's a first run at a SPARK-based parser for the location
part of the feature table.  BTW, the documentation at
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
contains several errors that I could tell

  ***

If a location is between 102 and 110 inclusive, do you use
"(102.110)"  as the example has, or "102.110" as given in the
BNF?

base_position ::= <integer> | <low_base_bound> | <high_base_bound> |
 <two_base_bound> 

two_base_bound ::= <base_position>.<base_position>

  ***

Example 5.4 Plasmid has
CDS             join(complement(567..795)complement(21..349))

which ignores the comma

CDS             join(complement(567..795),complement(21..349))
                                        ^^^

  ***

There is an example showing "J00194:(100..202)" which also
does not agree with the BNF description.  From looking at
some real data, it seems the documentation should say
"J00194:100..202".

The BNF says
  symbol  ::= <letter> | <symbol><symbol_character> | 
              <symbol_character><symbol>

where
  symbol_character ::= <up_case_letter> | <low_case_letter> |
              <digit> | _ | - | ' | *

  letter ::= <up_case_letter> | <low_case_letter> 


This means 'AA' can be parsed as

     <symbol><symbol_character>
       |            |
    <letter>    <up_case_letter>
       |            |
      "A"          "A"
or
     <symbol_character><symbol>
       |                 |
 <up_case_letter>     <letter>
       |                 |
      "A"               "A"

so it's an ambiguous definition.

  ***

Additionally,  symbol_character needs to allow '.' to agree
with real-life data (see the regression tests for the text).
Instead, I just redefined
  symbol  ::= Re("[A-Za-z0-9_'*-][A-Za-z0-9_'*.]*")
(note the "." in the second []).


Anyway, the grammer is attached for anyone wishing to take it
farther.  Enjoy!

                    Andrew


-------------- next part --------------
# First pass at a parser for the location fields of a feature table.
# Everything likely to change.

# Based on the DDBJ/EMBL/GenBank Feature Table Definition Version 2.2
# Dec 15 1999 available from EBI, but the documentation is not
# completely internally consistent much less agree with real-life
# examples.  Conflicts resolved to agree with real examples.

# Uses John Aycock's SPARK for parsing
from spark import GenericScanner, GenericParser

# a list of strings to test
test_data = (
    "467",
    "23..400",
    "join(544..589,688..1032)",
    "1..1000",
    "<345..500",
    "<1..888",
    "(102.110)",
    "(23.45)..600",
    "(122.133)..(204.221)",
    "123^124",
    "145^177",
    "join(12..78,134..202)",
    "complement(join(2691..4571,4918..5163))",
    "join(complement(4918..5163),complement(2691..4571))",
    "complement(34..(122.126))",
    # The doc example allows "J00194:(100..202)" but not the BNF
    "J00194:100..202",
    "1..1509",
    "<1..9",
    "join(10..567,789..1320)",
    "join(54..567,789..1254)",
    "10..567",
    "join(complement(<1..799),complement(5080..5120))",
    "complement(1697..2512)",
    "complement(4170..4829)",
    # added a comma from the documentation
    "join(complement(567..795),complement(21..349))",
    "join(2004..2195,3..20)",
    "<1..>336",
    "394..>402",

    # a few examples from from hum1
    "join(AB001090.1:1669..1713)",
    "join(AB001090.1:1669..1713,AB001091.1:85..196)",
    "join(AB001090.1:1669..1713,AB001091.1:85..196,AB001092.1:40..248,AB001093.1:96..212,AB001094.1:71..223,AB001095.1:87..231,AB001096.1:33..211,AB001097.1:35..175,AB001098.1:213..395,AB001099.1:56..309,AB001100.1:54..196,AB001101.1:171..404,AB001102.1:160..378,210..217)",
    "join(9106..9239,9843..9993,11889..11960,16575..16650)",
    "join(<1..109,620..>674)",
    "join(AB003599.1:<61..315,AB003599.1:587..874,47..325,425..>556)",
    "join(<85..194,296..458,547..>653)",
    )

class Token:
    def __init__(self, type):
        self.type = type
    def __cmp__(self, other):
        return cmp(self.type, other)
    def __repr__(self):
        return "Tokens(%r)" % (self.type,)


# "38"
class Integer:
    type = "integer"
    def __init__(self, val):
        self.val = val
    def __cmp__(self, other):
        return cmp(self.type, other)
    def __str__(self):
        return str(self.val)
    def __repr__(self):
        return "Integer(%s)" % self.val

# From the BNF definition, this isn't needed.  Does tht mean
# that bases can be refered to with negative numbers?
class UnsignedInteger(Integer):
    type = "unsigned_integer"
    def __repr__(self):
        return "UnsignedInteger(%s)" % self.val

class Symbol:
    type = "symbol"
    def __init__(self, name):
        self.name = name
    def __cmp__(self, other):
        return cmp(self.type, other)
    def __str__(self):
        return str(self.name)
    def __repr__(self):
        return "Symbol(%s)" % repr(self.name)


# ">38"  -- The BNF says ">" is for the lower bound.. seems wrong to me
class LowBound:
    def __init__(self, base):
        self.base = base
    def __repr__(self):
        return "LowBound(%r)" % self.base

# "<38"
class HighBound:
    def __init__(self, base):
        self.base = base
    def __repr__(self):
        return "HighBound(%r)" % self.base

# 12.34
class TwoBound:
    def __init__(self, low, high):
        self.low = low
        self.high = high
    def __repr__(self):
        return "TwoBound(%r, %r)" % (self.low, self.high)

# 12^34
class Between:
    def __init__(self, low, high):
        self.low = low
        self.high = high
    def __repr__(self):
        return "Between(%r, %r)" % (self.low, self.high)

# 12..34
class Range:
    def __init__(self, low, high):
        self.low = low
        self.high = high
    def __repr__(self):
        return "Range(%r, %r)" % (self.low, self.high)

class Function:
    def __init__(self, name, args):
        self.name = name
        self.args = args
    def __repr__(self):
        return "Function(%r, %r)" % (self.name, self.args)

class AbsoluteLocation:
    def __init__(self, path, local_location):
        self.path = path
        self.local_location = local_location
    def __repr__(self):
        return "AbsoluteLocation(%r, %r)" % (self.path, self.local_location)

class Path:
    def __init__(self, database, accession):
        self.database = database
        self.accession = accession
    def __repr__(self):
        return "Path(%r, %r)" % (self.database, self.accession)

class FeatureName:
    def __init__(self, path, label):
        self.path = path
        self.label = label
    def __repr__(self):
        return "FeatureName(%r, %r)" % (self.path, self.label)
    

class LocationScanner(GenericScanner):
    def __init__(self):
        GenericScanner.__init__(self)

    def tokenize(self, input):
        self.rv = []
        GenericScanner.tokenize(self, input)
        return self.rv

    def t_double_colon(self, input):
        r" :: "
        self.rv.append(Token("double_colon"))
    def t_double_dot(self, input):
        r" \.\. "
        self.rv.append(Token("double_dot"))
    def t_dot(self, input):
        r" \.(?!\.) "
        self.rv.append(Token("dot"))
    def t_caret(self, input):
        r" \^ "
        self.rv.append(Token("caret"))
    def t_comma(self, input):
        r" \, "
        self.rv.append(Token("comma"))
    def t_integer(self, input):
        r" -?[0-9]+ "
        self.rv.append(Integer(int(input)))
    def t_unsigned_integer(self, input):
        r" [0-9]+ "
        self.rv.append(UnsignedInteger(int(input)))
    def t_colon(self, input):
        r" :(?!:) "
        self.rv.append(Token("colon"))
    def t_open_paren(self, input):
        r" \( "
        self.rv.append(Token("open_paren"))
    def t_close_paren(self, input):
        r" \) "
        self.rv.append(Token("close_paren"))
    def t_symbol(self, input):
        r" [A-Za-z0-9_'*-][A-Za-z0-9_'*.-]* "
        # Needed an extra '.'
        self.rv.append(Symbol(input))
    def t_less_than(self, input):
        r" < "
        self.rv.append(Token("less_than"))
    def t_greater_than(self, input):
        r" > "
        self.rv.append(Token("greater_than"))

# punctuation .. hmm, isn't needed for location
#        r''' [ !#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~] '''


class LocationParser(GenericParser):
    def __init__(self, start='location'):
        GenericParser.__init__(self, start)
        self.begin_pos = 0

    def p_location(self, args):
        """
        location ::= absolute_location
        location ::= feature_name
        location ::= function
        """
        return args[0]
    
    def p_function(self, args):
        """
        function ::= functional_operator open_paren location_list close_paren
        """
        return Function(args[0].name, args[2])
    
    def p_absolute_location(self, args):
        """
        absolute_location ::= local_location
        absolute_location ::= path colon local_location
        """
        if len(args) == 1:
            return AbsoluteLocation(None, args[-1])
        return AbsoluteLocation(args[0], args[-1])
    
    def p_path(self, args):
        """
        path ::= database double_colon primary_accession
        path ::= primary_accession
        """
        if len(args) == 3:
            return Path(args[0], args[2])
        return Path(None, args[0])
    
    def p_feature_name(self, args):
        """
        feature_name ::= path colon feature_label
        feature_name ::= feature_label
        """
        if len(args) == 3:
            return FeatureName(args[0], args[2])
        return FeatureName(None, args[0])

    def p_feature_label(self, args):
        """
        label ::= symbol
        """
        return args[0].name

    def p_local_location(self, args):
        """
        local_location ::= base_position
        local_location ::= between_position
        local_location ::= base_range
        """
        return args[0]
    def p_location_list(self, args):
        """
        location_list ::= location
        location_list ::= location_list comma location
        """
        if len(args) == 1:
            return args
        return args[0] + [args[2]]

    def p_functional_operator(self, args):
        """
        functional_operator ::= symbol
        """
        return args[0]

    def p_base_position(self, args):
        """
        base_position ::= integer
        base_position ::= low_base_bound
        base_position ::= high_base_bound
        base_position ::= two_base_bound
        """
        return args[0]

    def p_low_base_bound(self, args):
        """
        low_base_bound ::= greater_than integer
        """
        return LowBound(args[1])

    def p_high_base_bound(self, args):
        """
        high_base_bound ::= less_than integer
        """
        return HighBound(args[1])

    def p_two_base_bound(self, args):
        """
        two_base_bound ::= open_paren base_position dot base_position close_paren
        """
        # main example doesn't have parens but others do.. (?)
        return TwoBound(args[1], args[3])
    
    def p_between_position(self, args):
        """
        between_position ::= base_position caret base_position
        """
        return Between(args[0], args[2])

    def p_base_range(self, args):
        """
        base_range ::= base_position double_dot base_position
        """
        return Range(args[0], args[2])
        
    def p_database(self, args):
        """
        database ::= symbol
        """
        return args[0].name

    def p_primary_accession(self, args):
        """
        primary_accession ::= symbol
        """
        return args[0].name

def scan(input):
    scanner = LocationScanner()
    return scanner.tokenize(input)

def parse(tokens):
    #print "I have", tokens
    parser = LocationParser()
    return parser.parse(tokens)

if __name__ == "__main__":
    for s in test_data:
        print "--> Trying", s
        print repr(parse(scan(s)))

From dag at sonsorol.org  Fri Dec  8 14:18:44 2000
From: dag at sonsorol.org (Chris Dagdigian)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Changes to the wiki-enabled portions of our website(s)
Message-ID: <5.0.2.1.0.20001208135218.00a99ec0@fedayi.sonsorol.org>

Hi folks,

Over the past few months there have been several incidents where people 
have abused the collaborative editing features contained within the 
wiki-enabled portions of the Open Bio websites (bioperl.org, biojava.org, 
biocorba.org, bioxml.org and biopython.org).

The most recent incident happened within last 24 hours when someone deleted 
and/or attempted to change the bioperl wiki docs that outlined our release 
07 roadmap and module checklist.

Although we have enough logs & audit data to start tracking these people 
down we haven't bothered - simple web vandals are not worth our time. The 
CVS integration within Wiki makes it easy to roll back the malicious 
deletions & changes whenever we detect them. Special thanks are  owed to 
Jason Stajich who wrote some behind-the-scenes scripts that automate the 
rebuild/recover process.

The problem has now become one of  administrative time and effort -- we 
have better things to do than monitor our wiki constantly. At the same time 
the obvious benefits of  having anyone within our projects be able to 
create and update web content make it essential to keep the system around.

Hence a compromise (and a bit of a social experiment):

We are making the assumption that the web vandals are just random surfers 
who chanced on our site and could not resist the temptation of web links 
that say "edit this page" and "delete this page".  We are hoping that they 
are not also subscribers who are reading our mailing lists :)

So-- I have now password protected the "edit" and "delete" portions of all 
the various Open Bio project wiki sites.

The 'experiment' is that  this email is going to disclose the username and 
password so that all of you can continue to help improve and update our web 
content. We are hoping that this semi-public password will be enough to 
keep our site safe from the casual sort of mischief.

Wiki edit/delete access info:
======================
username: wiki
password: wicked


Our backup plan if this experiment fails is to change the password and 
reveal it only to people who ask for it. I'm hoping that we will not have 
to take this step as it will have the effect of slowing down our content 
creation and updating progress.

Regards,
Chris (and all the Open Bio admin folks)


Chris Dagdigian -- Blackstone Technology Group
(Work  ) dagdigian@computefarm.com  (Home) dag@sonsorol.org
(Web   ) http://ComputeFarm.com, http://open-bio.org, http://sonsorol.org
(More  ) Full contact info and schedule -- http://sonsorol.org/dag/contact.html


From dalke at acm.org  Fri Dec  8 18:17:21 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Fw: VMD Python binaries available for testing
Message-ID: <010d01c0616d$8d4961a0$9cac323f@josiah>

This is a molecular visualization program I used to work on
several years ago.  It was all Tcl based, but now they are
adding python support.  Anyone interested in checking it out?

                    Andrew

-----Original Message-----
From: Justin Gullingsrud <justin@ks.uiuc.edu>
To: vmd-l@ks.uiuc.edu <vmd-l@ks.uiuc.edu>
Date: Friday, December 08, 2000 1:10 PM
Subject: VMD Python binaries available for testing


>A new feature in VMD 1.6 will be the addition of an embedded Python
>interpreter in VMD, with the ability to run scripts, import existing
>modules, and control VMD. The Python interpreter co-exists with the
>Tcl interpreter which is also part of VMD; you can use either
>interpreter, or both, and switch between them.
>
>Features
>
>Nearly all the VMD Tcl functions will have functional Python analogues
>when VMD 1.6 is released. Support for the Tkinter GUI module will
>also be provided. Complete documentation for the available Python
>commands can be found in the User's Guide. VMD 1.6 will use Python
>2.0. All of the Python modules for VMD will work without installing
>Python on your system; of course, if you do have the Python libraries,
>you can tell VMD where to find them and incorporate into your VMD
>scripts. Again, see the documentation for more information.
>
>Installation
>
>Binaries for IRIX6, Linux-Mesa, and Linux-DRI are now available from the
>TB ftp site, ftp.ks.uiuc.edu, in the directory pub/mdscope/vmd/python/.
>The binaries are a beta version of VMD, not a final release.  Installation
>proceeds in exactly the same way as previous versions.  You may need to set
>an environment variable to direct the VMD Python interpreter to the
location
>of your Python libraries; e.g.
> setnev PYTHONPATH /usr/local/lib/python2.0
> - or -
> setenv PYTHONPATH /home/justin/vmd/Python-2.0/lib_LINUX/lib/python2.0
>
>or
> setenv PYTHONHOME /usr/local
> - or -
> setenv PYTHONHOME /home/justin/vmd/Python-2.0/lib_LINUX
>
>For PYTHONPATH, use the location of the actual python libraries.  For
>PYTHONHOME, use the directory in which python was installed (the prefix
>directory in the configure script).
>
>
>Try it out and let us know how it goes!
>
>Justin
>
>--
>
>Justin Gullingsrud      3111 Beckman Institute
>H: (217) 384-4220       I got a million ideas that I ain't even rocked
yet...
>W: (217) 244-8946       -- Mike D


From dalke at acm.org  Sat Dec  9 01:55:08 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] PIR parsing
Message-ID: <003701c061ad$2194a4c0$1fac323f@josiah>

Me:
>I've written a much more complete PIR CODATA parser which works with
>the latest PIR release (Release 66.00, September 30, 2000).  I tested
>it against pir1.dat and pir3.dat.

I'm testing it against pir2.dat, which is 394,221,543 bytes
uncompressed and 174,756 records.  I'm doing the run on
the bioperl.org machine since it has more disk space available
than my laptop.  The parser parses about 3 or 4 records per
second (sshd takes 1/2 the CPU!).

I've processed 15% of the records and found only two problems
in my parser.  Both are my fault because I made too strong an
assumption of the format.

BTW, the format definition at 
   http://pir.georgetown.edu/pirwww/otherinfo/doc/co2.pdf
is wrong in many of the details - probably because it is
6 years old.

                    Andrew


From dalke at acm.org  Sat Dec  9 02:29:11 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] PIR parsing
Message-ID: <004b01c061b1$bac740e0$1fac323f@josiah>

Forgot to ask,

  What is the point of having both the "ref" and "dat" format
in PIR.

ref format example:

>P1;I52708
ELAV-like neuronal protein 1, truncated splice form - human
N;Alternate names: Drosophila ELAV(embryonic lethal, abnormal vision)-like
4; Hu a
ntigen D; paraneoplastic encephalomyelitis antigen
C;Species: Homo sapiens (man)

dat format example:

ENTRY           I52708  #type complete
TITLE           ELAV-like neuronal protein 1, truncated splice form - human
ALTERNATE_NAMES Drosophila ELAV(embryonic lethal, abnormal vision)-like 4;
                Hu antigen D; paraneoplastic encephalomyelitis antigen
ORGANISM        #formal_name Homo sapiens #common_name man


As far as I can tell, the ref format is easier to machine parse
than the dat one, and is more compact.  The dat format is easier
for a human to scan.  Also, the dat format contains the sequence
information while the ref one does not.

Can anyone here provide to me some background?

                    Andrew


From jchang at SMI.Stanford.EDU  Sat Dec  9 23:15:02 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] checked in code for GenBank access
Message-ID: <Pine.GSO.4.21.0012092013450.17525-100000@taiyang>

Hello everybody,

Since we've got a GenBank parser in the works (thanks Brad!), I've checked
in some code to search and retrieve records from GenBank.  It's in
Bio/GenBank.  We'll also put the parser in there soon.

Jeff


From katel at worldpath.net  Sun Dec 10 23:25:42 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] multiline location
References: <002501c060f5$e4b205a0$9cac323f@josiah>
Message-ID: <003501c0632a$6dd00720$010a0a0a@cadence.com>

  When I fed this multiline to parse_location.py

    """complement(join(8811..8995,9120..10082,10181..10291,
    10608..10852,10996..11147,11461..11559))
    """
 It reported this errog message
Syntax error at or near `Tokens('comma')' token

 The Trying -> print s line displayed
-> Trying complement(join(8811..8995,9120..10082,10181..10291,
    10608..10852,10996..11147,11461..11559))
    <345..500

  It was reading past the closing triple quote.

                          Cayte


From katel at worldpath.net  Sun Dec 10 23:30:52 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] multiline location
References: <002501c060f5$e4b205a0$9cac323f@josiah> <003501c0632a$6dd00720$010a0a0a@cadence.com>
Message-ID: <003f01c0632b$25ff3640$010a0a0a@cadence.com>

----- Original Message -----
From: "Cayte" <katel@worldpath.net>
To: "Andrew Dalke" <dalke@acm.org>; <biopython-dev@biopython.org>
Sent: Sunday, December 10, 2000 8:25 PM
Subject: [Biopython-dev] multiline location


>   When I fed this multiline to parse_location.py
>
>     """complement(join(8811..8995,9120..10082,10181..10291,
>     10608..10852,10996..11147,11461..11559))
>     """
>  It reported this errog message
> Syntax error at or near `Tokens('comma')' token
>
>  The Trying -> print s line displayed
> -> Trying complement(join(8811..8995,9120..10082,10181..10291,
>     10608..10852,10996..11147,11461..11559))
>     <345..500
>
>   It was reading past the closing triple quote.
>
>                           Cayte
>
   The test also failed when I used the backslash-linefeed-backslash format
instead of the triple quote.

          cayte


From edwin.steele at eBioinformatics.com  Mon Dec 11 00:30:52 2000
From: edwin.steele at eBioinformatics.com (Edwin Steele)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] PIR parsing
In-Reply-To: <004b01c061b1$bac740e0$1fac323f@josiah>
Message-ID: <002801c06333$86adc170$bd2aa8c0@au.int.enbio.com>

Andrew,

>   What is the point of having both the "ref" and "dat" format
> in PIR.
[snip]
> As far as I can tell, the ref format is easier to machine parse
> than the dat one, and is more compact.  The dat format is easier
> for a human to scan.  Also, the dat format contains the sequence
> information while the ref one does not.
>
> Can anyone here provide to me some background?

seq is usually derived from dat so that blast databases (or anything
else that requires fasta formatted sequences) can be made. I understand
that ref is a trimmed down dat without sequence data so you can save some
space by not keeping the partially redundant dat. I don't know for sure,
but the more compact format might be another measure along those lines.

Perhaps, though they're competing with the OWL database for the
 most obfuscated database format ;)

Cheers,
Edwin.
-------------------------------------------------------------------------------
Edwin Steele
QA Manager, eBioinformatics.             http://www.ebioinformatics.com
email: edwin.steele@eBioinformatics.com  Bay 16/104, Australian Technology Park
ph: +61 (2) 9209-4765                    Eveleigh 1430, NSW, Australia.


From edwin.steele at eBioinformatics.com  Mon Dec 11 01:09:09 2000
From: edwin.steele at eBioinformatics.com (Edwin Steele)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <001401c05f5c$056e0460$95ac323f@josiah>
Message-ID: <002b01c06338$e049df70$bd2aa8c0@au.int.enbio.com>

Brad,

> Here's a justification for this.  It's already common practice with
> GenBank files to have subitems indented under the major item.  For
> example,
>
> SOURCE      thale cress.
>   ORGANISM  Arabidopsis thaliana

There are a few caveats that come up with indenting that I've come
across. Save the feature table, there used to be only one level of
subitem. The new PUBMED tag breaks this paradigm:

REFERENCE   1  (bases 1 to 675)
  AUTHORS   Sant,V.J., Sainani,M.N., Sami-Subbu,R., Ranjekar,P.K. and
            Gupta,V.S.
  TITLE     Ty1-copia retrotransposon-like elements in chickpea genome: their
            identification, distribution and use for diversity analysis
  JOURNAL   Gene 257 (1), 157-166 (2000)
   PUBMED   11054578

It's indented three spaces instead of two...

Brad, this will mean your indent_space definition will break (or pick
up unnecessary stuff).

Also, it's not fair to assume that the initial indenting is two spaces.
In some of the larger entries like LMFLCHR12 that is about 2000000 bp
long, the seven figures in the origin section causes there to be a one
character indent instead of the normal two character minimum.

ORIGIN
       1  TCAGTTTGTG CGGGGTGTGC ATATGCATGT GCATGCATAC ATGCACATAC ACATATATAC
...
 2287441  GCGTCACGTG GCGACGTCGA GGCCCGCAGC TTCTATTTTT TTT
//

However, I don't think this will break anything in the parser, but is
something to be remembered if you become more strict...

Cheers,
Edwin.
-------------------------------------------------------------------------------
Edwin Steele
QA Manager, eBioinformatics.             http://www.ebioinformatics.com
email: edwin.steele@eBioinformatics.com  Bay 16/104, Australian Technology Park
ph: +61 (2) 9209-4765                    Eveleigh 1430, NSW, Australia.


From dalke at acm.org  Mon Dec 11 15:55:12 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
Message-ID: <003f01c063b4$abbc95a0$c2ab323f@josiah>

I was playing around with a different way to handle
the FEATURES section and came across this example
in IRO125195:

FEATURES             Location/Qualifiers
     source          1..1326
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="21"
                     /clone="IMAGE cDNA clone 125195"
                     /clone_lib="Soares fetal liver spleen 1NFLS"
                     /note="contains Alu repeat; likely to be be derived
from
                     unprocessed nuclear RNA or genomic DNA; encodes
putative
                     exons identical to FTCD; formimino transferase
                     cyclodeaminase; formimino transferase (EC 2.1.2.5)
                     /formimino tetrahydro folate cyclodeaminase (EC
4.3.1.4)"


See the "/formimino"?  I had thought that any line starting
with a '/' was a new qualifier, but it looks like you really do
have to parse the quotes as you go to tell when you are done.
While the qouted quote checking (double the "s) is doable with
a regular expression, it's gets pretty complicated.

                    Andrew


From dalke at acm.org  Mon Dec 11 17:03:38 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Martel performance
Message-ID: <004001c063be$3808a2c0$c2ab323f@josiah>

I'm finding my PIR and the GenBank parser (the last rather modified
from Brad's because I was trying to be more strict on whitespace)
to be pretty slow.  The PIR parser only parses 43K of text per
second while the GenBank one is but 6.6K/second.  Compare that
to the SwissProt parser where I was parsing the whole file
in 20 minutes, which is about 200K per second.

These tests were done on different machines, but there's only
about a factor of 2 performance difference between them.
(Comparison done by running my genbank regression test on
my Intel laptop and on the bioperl.org Alpha machine, which
is where the PIR and GenBank tests are run.  My laptop is
faster although sshd on bioperl takes 50% of the CPU.)

I can only think of a few reasons which might cause this:

  1) Martel is intrinsicly slow - but see sprot as a counter example

  2) These two files use indented whitespace for continuations an
to indicate subitems. Almost every time you get to the end of a
line it needs to test if the next line is a continuation.  In most
cases it isn't, so about 1/4 of the file is read twice.  But that's
not a factor of 20.

  3) Brad has a list of possible feature key names and a list of
qualifiers.   Odds are you have to scan 1/2 the list before
finding a matching name.  This again causes some duplicate
checks, but only in the features section and I just can't see
another factor of two out of that.

  4) The regexp to allow folding with the whitespace indentation
is something like:
  indicator + \
   Group("tag", text) + \
   Rep(space_indent + Group("tag", text))

This can make for some very large regular expressions.  GenBank,
when expressed as a string, is about 6K long and the generated
tag table itself is hard to guess, but it's roughly 100K while
PIR is about 600K.  These are state transition tables so perhaps
I'm loosing cache coherency because most of my jumps are too
large.  I don't know what effect sshd has on the overall
bioperl.org performance.  It only have 72K of RSS so I can't
see how there's a bad context swap hit.

  I can't find any equivalent on Linux to IRIX's 'osview'
or 'gr_osview', which is what I usually used to look at this
sort of overhead.  Any pointers?

  5) I'm using the same RecordReader for SWISS-PROT and
GenBank (EndsWith) so that shouldn't be a problem.  However,
in the first I think I was using the reader directly while
with GenBank I'm going through the HeaderFooter parser.
There might be some difference there, but I can't think of
what that might be.

  6) Memory use

I'm using gbpri8 as my test case.  The first entry, HUAF001549,
is about 260K long with 202K bases.  This causes my format
definition to take up 50MB (!) of memory according to top,
so a 20-fold expansion.  My test with SWISS-PROT and MDL's .mol
files only needed a factor of about 6 as I recall.  I don't
know why so much memory is needed for GenBank and I didn't
look at PIR's use to compare.
  As an aside, Edwin Steele points out that LMFLCHR12 has
2Mbases so is about an order of magnitude larger.  Well, RAM
is cheap.

Without Martel running, bioperl.org's 'free' says:
          total       used     free   shared  buffers   cached
Mem:     126568     121048     5520    58568     4544    77624
-/+ buffers/cache:   38880    87688
Swap:    208760      23760   185000

When I run the test, it says:
           total       used     free   shared  buffers   cached
Mem:      126568     123056     3512    57656     3688    29760
-/+ buffers/cache:    89608    36960
Swap:     208760      23704   185056

Compare that to top's
 PID USER  PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM  TIME COMMAND
7930 dalke  17   0 53240  51M  1824 R     51M 49.8 21.0  5:43 python

As I read it, all of the memory is being used, but 77MB was
used for cache.  When the python job started, that moved
out and giving 47MB to python, so it's all running in main
memory.  Only about 56K more of swap is being used, so there
isn't a lot of page swapping going on.

I've ordered a new disk for my laptop and more memory.  That
will give me a chance to test everything on dedicated machine.
Hopefully the problem is simply context switch overhead with
the sshd2 and http sessions on bioperl.org.

I've put off doing real work for too long so I won't have time
to look at this for a couple of weeks.   If anyone wants to work
out what the problem is using the latest code, it's on
biopython.org in /tmp/dalke/gb/Martel .  It's now in the tedious
part of timing and profiling.  (One approach might be to take
a section of a file, duplicate it a lot of times, and measure
how the times and memory use changes as a function of size.)

Hmm.  There is another difference between the GenBank format
and the others.  I'm using the \R construct for newline detection.
Perhaps there's some unexpected performance hit there, though I
can't see what that would be.

                    Andrew
                    dalke@acm.org


From dalke at acm.org  Mon Dec 11 17:19:04 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Martel performance
Message-ID: <008c01c063c0$5fff2040$c2ab323f@josiah>

P.S:
  By looking at the output as it parses, it's easy to
tell that some of the records are processed quite quickly
while others take a long time.  That should provide some
hint as to where the performance hit comes in.

                    Andrew


From edwin.steele at eBioinformatics.com  Mon Dec 11 22:43:06 2000
From: edwin.steele at eBioinformatics.com (Edwin Steele)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] GenBank parser -- first go
In-Reply-To: <003f01c063b4$abbc95a0$c2ab323f@josiah>
Message-ID: <002601c063ed$a332d370$bd2aa8c0@au.int.enbio.com>

Andrew,

> See the "/formimino"?  I had thought that any line starting
> with a '/' was a new qualifier, but it looks like you really do
> have to parse the quotes as you go to tell when you are done.
> While the qouted quote checking (double the "s) is doable with
> a regular expression, it's gets pretty complicated.

I've found this too.
A good test for a new qualifier is if it starts with a '/' and either:
- Have an even no. of quotes and end with a '"' or
- Have an odd no. of quotes and do not end with a '"' or
- Have no quotes at all.

Erk.

Cheers,
Edwin.
-------------------------------------------------------------------------------
Edwin Steele
QA Manager, eBioinformatics.             http://www.ebioinformatics.com
email: edwin.steele@eBioinformatics.com  Bay 16/104, Australian Technology Park
ph: +61 (2) 9209-4765                    Eveleigh 1430, NSW, Australia.


From katel at worldpath.net  Wed Dec 13 03:10:08 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Unigene parsers
References: <004001c063be$3808a2c0$c2ab323f@josiah>
Message-ID: <005001c064dc$1faba820$010a0a0a@cadence.com>

  To write a UniGene parser,  several issues need to be resolved.  The
UniGene page is structured with major keys and subkeys.  Each major key is
on a line be itself and is in all caps, but several subkeys can be placed on
a single line.  Each subkey is separated from its value by a colon.

  One problem is that the records vary in which keys they contain.  I ran
into this with Gobase.  It required calls to routines with tests like

        start = string.find( text, field )
        if( start == -1 ):
            return ''

  Calls to useless routines could waste a lot of CPU time.

  Would it be cleaner to read the major keys into a temporary dictionary and
then consume the ones that ae present and check that all the necessary keys
are present?

  A second problem is that since there can be several subkeys on a line,
with only white space separating the value  from the next key, multiword
keys or values can be ambiguous.  You can make guesses but there's no
guaranteed way to disambiguate the subkey/value pairs.

 A third issue is that the record only displays the first ten sequences of
the cluster.  How do we deal with information that is spread over  several
web pages?

                                                Cayte


From jchang at SMI.Stanford.EDU  Wed Dec 13 18:19:10 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
Message-ID: <Pine.GSO.4.21.0012071558490.21689-100000@riboweb.Stanford.EDU>

Hello everybody,

The plan was to try and get out a relatively quick release with Martel &
mxTextTools bundled in.  There's a few things we need to work out as this
is happening:

- Andrew, is Martel under source code control?  Do you want to develop it
as part of the biopython CVS (and release), or do you want it to be a
dependency that's bundled and installed together?

- Are we going to use/bundle mxTextTools 1.2?

- setup.py now accepts earlier versions (<0.8?) of distutils.  Should we
require the version that comes with Python 2.0?  This would simplify the
script, I think.

- Any objections to moving more code into __init__.py?  For example, the
code in Prosite/Prosite.py would be moved to Prosite/__init__.py.  This
would definitely BREAK CODE, but the fix would be trivial.  If this does
happen, does anyone know how to move code between files without losing the
CVS logs of the changes?

- Should we check in Brad's new GenBank code?

- ... and Brad's SeqFeature classes?

- Andrew, I've submitted a bug report (more of a feature request) in
Jitterbug about making the regression tests indifferent to EOL
conventions.  This would be nice if people are developing and testing on
different platforms, which breaks the tests.  Could you look at it and let
me know what you think?

- Anyone good with Distutils and think they can get Martel and mxTextTools
to install with biopython?  :)

Jeff


From chapmanb at arches.uga.edu  Thu Dec 14 05:24:49 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
In-Reply-To: <Pine.GSO.4.21.0012071558490.21689-100000@riboweb.Stanford.EDU>
References: <Pine.GSO.4.21.0012071558490.21689-100000@riboweb.Stanford.EDU>
Message-ID: <14904.40945.699446.764013@taxus.athen1.ga.home.com>

Hey Jeff;

> The plan was to try and get out a relatively quick release with Martel &
> mxTextTools bundled in.  

Release early, release often -- sounds good!

> - setup.py now accepts earlier versions (<0.8?) of distutils.  Should we
> require the version that comes with Python 2.0?  This would simplify the
> script, I think.

I think we should do this -- we can detect an old version and just
tell people to upgrade. They need to ugrade if they are using such an
old version :-).

> - Any objections to moving more code into __init__.py?  For example, the
> code in Prosite/Prosite.py would be moved to Prosite/__init__.py.  This
> would definitely BREAK CODE, but the fix would be trivial. 

This is okay by me, although I don't really think it's necessary. I
don't find it that annoying to import the double Prosites (or
whatever) but that is just me. 

The only sort of objection is that sometimes people don't look for
actual code in __init__.py files (I know I didn't at first when I was
using python), so it could make it more confusing to browse the code
if you are new to python.

But you know lots more about python coding and style than I, so if you 
prefer it, I'm not going to stop ya :-)

>  If this does
> happen, does anyone know how to move code between files without losing the
> CVS logs of the changes?

I'm not enough of a CVS expert to know this -- maybe Ewan, the master
o' CVS (and everything else :-), would be a good person to ask?

> - Should we check in Brad's new GenBank code?
> 
> - ... and Brad's SeqFeature classes?

I hope to have a new version of these after this weekend, with
suggestions from everyone included. Whether or not to include 'em is
up to everyone else though. I do plan to do a GenBank specific record
class, so if people don't like the SeqFeature classes, we can just
include the GenBank specific stuff.

> - Anyone good with Distutils and think they can get Martel and mxTextTools
> to install with biopython?  :)

If we can get Martel and mxTextTools to install with distutils, then I 
think I could try this. The last version of mxTextTools that I could
find doesn't use distutils, but I might be missing a newer version
that does. I thought I saw MAL asking lots of questions about
distutils on the SIG mailing list...

I guess doing this would be a matter of:

o Doing a test import mxTextTools and Martel.

o If they can't be imported -- fetch them from an ftp site and unpack 
them.

o Run setup.py on these modules and install them, then go back 
to the regular installation.

I think the newly formed catalog-sig
(http://python.org/sigs/catalog-sig/) is interested in getting
something like this going generally, but I'm not sure at all about the 
status of any kind of implementation.

Brad


From jchang at SMI.Stanford.EDU  Thu Dec 14 18:37:53 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
In-Reply-To: <14904.40945.699446.764013@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0012141533060.15826-100000@riboweb.Stanford.EDU>

> > - setup.py now accepts earlier versions (<0.8?) of distutils.  Should we
> > require the version that comes with Python 2.0?  This would simplify the
> > script, I think.
> 
> I think we should do this -- we can detect an old version and just
> tell people to upgrade. They need to ugrade if they are using such an
> old version :-).

OK.


> > - Any objections to moving more code into __init__.py?  For example, the
> > code in Prosite/Prosite.py would be moved to Prosite/__init__.py.  This
> > would definitely BREAK CODE, but the fix would be trivial. 
> 
> This is okay by me, although I don't really think it's necessary. I
> don't find it that annoying to import the double Prosites (or
> whatever) but that is just me. 

Yeah, it's mostly a cosmetic change.  Plus, it seems to match how things
are done in other packages, e.g. Martel, xml.


> > - Should we check in Brad's new GenBank code?
> > 
> > - ... and Brad's SeqFeature classes?
> 
> I hope to have a new version of these after this weekend, with
> suggestions from everyone included. Whether or not to include 'em is
> up to everyone else though. I do plan to do a GenBank specific record
> class, so if people don't like the SeqFeature classes, we can just
> include the GenBank specific stuff.

Alright.  Let's plan on including everything, unless anyone has strenuous
objectsion.  Let's see what happens...

Are you also including Andrew's location parser?


> > - Anyone good with Distutils and think they can get Martel and mxTextTools
> > to install with biopython?  :)
> 
> If we can get Martel and mxTextTools to install with distutils, then I 
> think I could try this. The last version of mxTextTools that I could
> find doesn't use distutils, but I might be missing a newer version
> that does. I thought I saw MAL asking lots of questions about
> distutils on the SIG mailing list...

Hmmm, if he's working on disutils-ing the package, then we shouldn't
duplicate that work.

Andrew, do you know anything about this?  Do you mind sending him a quick
email to see whether it's going to happen?


> I guess doing this would be a matter of:
> 
> o Doing a test import mxTextTools and Martel.
> 
> o If they can't be imported -- fetch them from an ftp site and unpack 
> them.

How to handle version differences?

Jeff


From katel at worldpath.net  Fri Dec 15 01:22:33 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Unigene parsers( cintinued )
References: <004001c063be$3808a2c0$c2ab323f@josiah> <005001c064dc$1faba820$010a0a0a@cadence.com>
Message-ID: <004f01c0665f$69d580e0$010a0a0a@cadence.com>

----- Original Message -----
From: "Cayte" <katel@worldpath.net>
To: <biopython-dev@biopython.org>
Sent: Wednesday, December 13, 2000 12:10 AM
Subject: [Biopython-dev] Unigene parsers


>  A third issue is that the record only displays the first ten sequences of
> the cluster.  How do we deal with information that is spread over  several
> web pages?
>
  I think the www scrips need to search for the correct link and pull in the
information.  The 10 sequence limit only makes sense in the GUI, not the way
the user is likely to use our scripts.  Other parsers may need this
capability too.

                                  Cayte


From dalke at acm.org  Sat Dec 16 21:12:21 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Re: Martel performance
Message-ID: <018f01c067ce$cc29f6c0$afab323f@josiah>

Short version.  I found the source of the slowdown.  I'll
revert my change which caused the problem, but that reintroduces
a behavioral problem I don't like.  Unfortunately, it's a
behaviour which is pretty inherent in Martel and I don't see
an easy fix, so it will have to stay in until someone better
than I figures out a good solution.

Me:
>I'm finding my PIR and the GenBank parser (the last rather modified
>from Brad's because I was trying to be more strict on whitespace)
>to be pretty slow.  The PIR parser only parses 43K of text per
>second while the GenBank one is but 6.6K/second.  Compare that
>to the SwissProt parser where I was parsing the whole file
>in 20 minutes, which is about 200K per second.

>I can only think of a few reasons which might cause this:

Ha!  There's another which was the real answer.

The MaxRepeat expression (used for "*" repeats) had a
bug where it doesn't fully allow backtracking.  For
example, suppose you have

  ([A-C][X-Z])*[A-F]

and match it against "BZA".  The bug in the code was that
it would match B against "[A-C]" then Z against "[X-Z]".
It would next try the A against "[A-C]" which would match
so try matching [X-Z], which fails since there is no other
character.

The bug was that it wouldn't backtrack against a *partial*
match, so it wouldn't try to see if [A-F] matches.  This
was because I was using 

    if max_count == sre_parse.MAXREPEAT:
        result.append( (None, TT.SubTable, tuple(tagtable),
                                  +1, 0))

which expands the current taglist instead of creating a new
subtaglist.  (Meaning matches were added to the current list
as they were found, rather than building a sublist and
merging the result only if the whole list matches.)

The fix I did was to replace the SubTable with a Table, and
use a fake tag name to tell mxTextTools to append the matches
upon success.

    if max_count == sre_parse.MAXREPEAT:
        #result.append( (">ignore", TT.Table, tuple(tagtable),
                                  +1, 0))

The consequence is that every repeat creates a new list with
the tag ">ignore" associated with it.  This explains the
memory use and performance.

Eg, consider ".*\n".  This is converted to something like
  (?P<ignore>.)*\n
which means matching a line of text of length N + "\n" creates
N sublists - one for each character.

When I took that fix out of Martel, the performance, which
was about 2 records per second, went up to 54 records per
second.  My test set, gbpri8, is 96MB and can be parsed in
525 seconds, or 187K/second.  This is equivalent to what I
was getting for parsing SWISS-PROT.

That actually the clue for how I found the problem.  I was
showing off Martel yesterday and noticed the SWISS-PROT parser
was a lot slower than it used to be.  That indicated that the
shift in performance was not anything to do with the machine
or the specific file format but with some change in Martel.
I mulled about about it enough that this morning, when I
was trying to sleep in, I ended up instead thinking about what
could be the cause.  Luckily, I remembered what I was thinking
about when I finally did wake up :)

Going back to the topic, it actually points out a problem
in Martel in that it isn't a true regular expression engine.
Once a full match occurs it doesn't consider other alternative.
Consider something which parses some of the feature keys for
GenBank.  It may have something like

  ... |prim_transcript|primer|primer_bind|protomer| ...

Suppose you have the key "primer_bind".  In that case, "primer"
will match (because that's the start of the word).  So next it
tries to match the spaces after the key and that fails, because
'_bind' isn't a space.  A real regular expression engine would
backtrack, throw the 'primer' match away and try again.
Martel doesn't do that.  Once it does a match for a given
grouping, it stays matched.  Higher level matches may discard
submatches which is why Martel appears to do backtracking.

The workaround for this '|' problem is to put the larger patterns
first, so place "primer_bind" before "primer".  Similarly, there
is a workaround to the "*" problem by putting the subpattern
in an explicit group rather than my solution of always putting
in an implicit '>ignore' group.

As another example of the problem, suppose you want to match
"\s+\n".  Martel will fail because \s+ consumes the final "\n"
so there is no additional text to match the \n after the \s+.
Again, a standard regexp engine will throw away the \s match
against \n and try again.  Martel does not.  The workaround
is to do something like " *\n".

I don't really like these workaround solutions because they
require people to be more aware of the differences between
Martel's regular expressions and the standard ones.  I haven't
been clever enough to figure out a good solution using
mxTextTools.  On the other hand, I haven't been greatly concerned
with it because:
  - Martel's behaviour is a subset of standard regular expressions,
so if Martel matches so will the standard one;
  - I figure someone cleverer than I may contribute a good solution;
  - At some point the whole evalutation engine may be replaced
by a C extension which can be made portable to Perl, Tcl, Java, etc;
  - I've still been prototyping to see what's useful and what isn't;
  - and of course, I know how to do the workarounds :)

I've been reading a bit of the regular expression and pattern
matching literature.  There are a lot of terms to describe the
types of regular expression languages.  For example, SGML DTDs
are "1-unambiguous" because it only needs to look ahead a
single tag to determine the next step in the DTD.  There's
also "deterministic" regular expressions.  I've decided I really
need to talk to someone who knows the field...

                    Andrew


From dalke at acm.org  Sat Dec 16 21:40:30 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
Message-ID: <01a801c067d2$b9b7d6c0$afab323f@josiah>

Jeff:
>- Andrew, is Martel under source code control?

Yes, CVS.

> Do you want to develop it as part of the biopython CVS (and
> release), or do you want it to be a dependency that's bundled
> and installed together?

I would rather it be the latter.  I expect people will want
to use Martel independent of the other biopython code.  I can
move the development repository to biopython.org.  My concerns
with that are two-fold.  First, I haven't figured out how to
connect to my ISP from under Linux, so I don't have a direct
connection to the rest of the world.  That makes it hard to
talk to CVS.

Second, supposing an update to a newer distribution fixes
my problem, I do most of my work on my laptop which isn't
always connected to the rest of the world.  I habitually
CVS commit a lot more frequently than I connect and I worry
about how that will affect my development habits.

>- Are we going to use/bundle mxTextTools 1.2?

Martel should work fine with 1.1.1 or 1.2, so the first is
not a concern.  I'll ask Marc-Andre about his release plans.

>- setup.py now accepts earlier versions (<0.8?) of distutils.
> Should we require the version that comes with Python 2.0?

Yes.  We have other dependencies now on 2.0 than just setup.py

>- Any objections to moving more code into __init__.py?  For
> example, the code in Prosite/Prosite.py would be moved to
> Prosite/__init__.py.  This would definitely BREAK CODE, but
> the fix would be trivial.

I have no problems, but I think I'm the one who introduced
using __init__.py to biopython so I'm not the best of sources.

Brad correctly pointed out that some people don't know about
that use so may get somewhat confused about it.  As I recall,
others here and elsewhere have had that problem so it shouldn't
be ignored.

On the other hand, I have had problems with another library
which had a module of the form "X.X" (like Prosite.Prosite).
In that case I needed to get elements from X and from X.X.
That has to be done with

import X.X
import X

a = X.X.a
b = X.b

The "import X.X" is needed to load X.X then the "import X"
is needed to bring the top-level module into the local namespace.

What this means is if you have Prosite/Prosite.py then do not
put anything into Prosite/__init__.py and vice versa.

>  If this does happen, does anyone know how to move code
> between files without losing the CVS logs of the changes?

I don't know.  Also, is there any way to import the CVS logs
of Martel?

>- Should we check in Brad's new GenBank code?

I think Brad and I still need to do a bit more work on the
parser definition.  Neither his original code nor my modified
version pass the "fully parses an NCBI file" although it's
getting pretty close.

A related question, and one which was raised earlier, is,
where should the format definitions be located in biopython?
There are also database specific builders (which convert the
format definitions to a database specific data structure) and
generic builders (eg, which make a generic data structure
but possibly discarding some data).

>- Andrew, I've submitted a bug report (more of a feature request)
>in Jitterbug about making the regression tests indifferent to
>EOL conventions.  This would be nice if people are developing
>and testing on different platforms, which breaks the tests.
>Could you look at it and let me know what you think?

Umm, I don't see it.  There are none assigned to me ... Oh!
with br_regrtest.  Sorry, I thought you were talking about
Martel.  Some of it's regression tests are also newline
specific.  Okay, it shouldn't be too hard.  Could either see
what changes are in the 2.0 distribution or replace the line
reader with something which understands the different styles.

Jeff in a reply to Brad:
> Are you also including Andrew's location parser?

Remember, that parser hasn't been seriously tested.  Also,
including it requires inclusion of SPARK.  That's not hard
because it's a single, pure-python file.  I think it should
be included because of its general usefulness and because
it isn't a real distribution in its own right.

                    Andrew
                    dalke@acm.org


From jchang at SMI.Stanford.EDU  Mon Dec 18 14:51:10 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
In-Reply-To: <01a801c067d2$b9b7d6c0$afab323f@josiah>
Message-ID: <Pine.GSO.4.21.0012181122360.25412-100000@riboweb.Stanford.EDU>

On Sat, 16 Dec 2000, Andrew Dalke wrote:

> Jeff:
> > Do you want to develop it as part of the biopython CVS (and
> > release), or do you want it to be a dependency that's bundled
> > and installed together?
> 
> I would rather it be the latter.  I expect people will want
> to use Martel independent of the other biopython code.

Sounds reasonable.  One thing, though: Martel currently ships with a few
formats for databases.  I do want the ones used for biopython to be CVS'd
in the biopython repository, so that developers with read/write access to
biopython can work on the formats.  I don't think biopython should depend
on format definitions in Martel.


> I can move the development repository to biopython.org.  My concerns
> with that are two-fold.  First, I haven't figured out how to connect
> to my ISP from under Linux, so I don't have a direct connection to the
> rest of the world.  That makes it hard to talk to CVS.

Either way doesn't make a difference to me.


> >- Are we going to use/bundle mxTextTools 1.2?
> 
> Martel should work fine with 1.1.1 or 1.2, so the first is
> not a concern.  I'll ask Marc-Andre about his release plans.

Thanks!


> >- setup.py now accepts earlier versions (<0.8?) of distutils.
> > Should we require the version that comes with Python 2.0?
> 
> Yes.  We have other dependencies now on 2.0 than just setup.py

OK.


> >- Any objections to moving more code into __init__.py?  For
> > example, the code in Prosite/Prosite.py would be moved to
> > Prosite/__init__.py.  This would definitely BREAK CODE, but
> > the fix would be trivial.
> 
> I have no problems, but I think I'm the one who introduced
> using __init__.py to biopython so I'm not the best of sources.
> 
> Brad correctly pointed out that some people don't know about
> that use so may get somewhat confused about it.  As I recall,
> others here and elsewhere have had that problem so it shouldn't
> be ignored.

Yeah, definitely.  I still overlook __init__.py when looking for code.  
What can be done about this?  Documentation?


> On the other hand, I have had problems with another library
> which had a module of the form "X.X" (like Prosite.Prosite).

Oh, I see what you're getting at.  That's definitely bad.

> What this means is if you have Prosite/Prosite.py then do not
> put anything into Prosite/__init__.py and vice versa.

Yep.  I'll interpret that as evidence to move stuff into __init__.py.  :)


> >  If this does happen, does anyone know how to move code
> > between files without losing the CVS logs of the changes?
> 
> I don't know.  Also, is there any way to import the CVS logs
> of Martel?

I suspect both solutions will require some surgery on the CVS repository
and RCS files.


> >- Should we check in Brad's new GenBank code?
> 
> I think Brad and I still need to do a bit more work on the
> parser definition.  Neither his original code nor my modified
> version pass the "fully parses an NCBI file" although it's
> getting pretty close.

Is this a "no" vote, then?  Rebuttals?


> A related question, and one which was raised earlier, is,
> where should the format definitions be located in biopython?
> 
> There are also database specific builders (which convert the
> format definitions to a database specific data structure) and
> generic builders (eg, which make a generic data structure
> but possibly discarding some data).

There's two places they can go.  First, you can put each one in the
package in which it belongs.  That means, the fasta format would go in
Bio/Fasta, genbank in Bio/GenBank, swissprot in Bio/SwissProt, etc.  This
would be consistent with the current design, and it would be clear where
to look for the format.

Second, we can have a formats package (could be called something else),
where we put all the Martel stuff.  This would make it easier to check to
see what formats exist, which could be helpful for SeqIO-type
functionality.  All you'd have to do is scan the directory and suck up all
the formats in there.  The other way, we'd have to specify them manually.

Any votes?  Comments?


> >- Andrew, I've submitted a bug report (more of a feature request)
> >in Jitterbug about making the regression tests indifferent to
> >EOL conventions.  This would be nice if people are developing
> >and testing on different platforms, which breaks the tests.
> >Could you look at it and let me know what you think?
> 
> Umm, I don't see it.  There are none assigned to me ... Oh!
> with br_regrtest.  Sorry, I thought you were talking about
> Martel.  Some of it's regression tests are also newline
> specific.  Okay, it shouldn't be too hard.  Could either see
> what changes are in the 2.0 distribution or replace the line
> reader with something which understands the different styles.

Great!  This will be nice, because some of the regression tests are
breaking on differing newline conventions.  Different styles can occur
within the same file.


> Jeff in a reply to Brad:
> > Are you also including Andrew's location parser?
> 
> Remember, that parser hasn't been seriously tested.  Also,
> including it requires inclusion of SPARK.  That's not hard
> because it's a single, pure-python file.  I think it should
> be included because of its general usefulness and because
> it isn't a real distribution in its own right.

Good, we'll include it, as well as SPARK, then.  Judging from the recent
traffic on the bioperl list, this is a feature we should have.

Jeff


From jchang at SMI.Stanford.EDU  Mon Dec 18 14:58:21 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] external dependencies...
Message-ID: <Pine.GSO.4.21.0012181151240.25412-100000@riboweb.Stanford.EDU>

How are we going to handle them?

Right now, we have:
Numeric
Martel
mxTextTools (with Martel)
SPARK


- Should we auto-detect whether they have it already installed?

- How do we handle version differences?

- Should we bundle these in the distribution or download them as needed?

- How much help should we provide the user in installing them?  
Completely automatic installation, or just gentle error messages and
URL's?

- Should we maintain copies of these at biopython.org?

- Is there something going on in catalog-sig or somewhere else that can
help us right now?

Jeff


From dalke at acm.org  Mon Dec 18 15:43:28 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
Message-ID: <00ab01c06935$b5799080$b0ac323f@josiah>

Me:
>> I think Brad and I still need to do a bit more work on the
>> parser definition.  Neither his original code nor my modified
>> version pass the "fully parses an NCBI file" although it's
>> getting pretty close.

Jeff:
>Is this a "no" vote, then?  Rebuttals?

Upon consideration, no, it is not a no vote.  Go ahead and
include it, but with the proviso that it is still in flux.

Ditto for the location parser.

                    Andrew


From jchang at SMI.Stanford.EDU  Mon Dec 18 17:38:06 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
In-Reply-To: <00ab01c06935$b5799080$b0ac323f@josiah>
Message-ID: <Pine.GSO.4.21.0012181437360.9510-100000@taiyang>

Alright.  We'll plan on including it in the next release, which is
pre-alpha.  Then, we'll pound on it, and other things, for 1.0.

Jeff


On Mon, 18 Dec 2000, Andrew Dalke wrote:

> Me:
> >> I think Brad and I still need to do a bit more work on the
> >> parser definition.  Neither his original code nor my modified
> >> version pass the "fully parses an NCBI file" although it's
> >> getting pretty close.
> 
> Jeff:
> >Is this a "no" vote, then?  Rebuttals?
> 
> Upon consideration, no, it is not a no vote.  Go ahead and
> include it, but with the proviso that it is still in flux.
> 
> Ditto for the location parser.
> 
>                     Andrew
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From dalke at acm.org  Tue Dec 19 13:17:41 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] questions for next release
Message-ID: <000e01c069e7$fcdede00$28ac323f@josiah>

Jeff
>> >- Are we going to use/bundle mxTextTools 1.2?

Me:
>> Martel should work fine with 1.1.1 or 1.2, so the first is
>> not a concern.  I'll ask Marc-Andre about his release plans.


M-A Lemburg:
> I wouldn't mind if you intergrate mxTextTools in your distro.

> My plans are to release all mx extensions using distutils
> and a new packaging strategy sometime in January next year.

> I will distutil the mx packages in three or more distributions
> (base, crypto, commercial) to enable dependencies between
> the packages. mxTextTools will be in the base version which
> will be open source as before only with a more Python 2.0
> like license.

                    Andrew
                    dalke@acm.org


From chapmanb at arches.uga.edu  Wed Dec 20 10:11:49 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Second go at GenBank parser
Message-ID: <14912.52277.659316.598153@taxus.athen1.ga.home.com>

Hello all;

I've got together a second tarball of the GenBank parser that we've
been working on. You can grab it from:

http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001222.tar.gz

I think this is a huge improvement from the first, mostly due to the
many many helpful comments from everyone here. I really appreciated
everyone's comments and interest, and I think that we've fixed/worked
on all of the points that people raised. I'll try to respond to some
specific mails later today. Sorry to not be able to respond to
everything in a timely manner. I guess if I only have time to write or 
code, it is better to be coding :-).

Anyways, the new version has the following new and
oh-so-incredibly-exciting features:

o Much better Martel syntax for parsing things. This is almost
entirely due to Andrew -- who sent me lots of nice comments and good
tips, and even wrote up his own syntax which I could borrow from. Tons 
of the new syntax is taken from Andrew's stuff, so he deserves a huge
pat on the back for this :-).

o Tested on a bunch of different downloads from the ncbi genbank
directory, so the syntax is much more "battle tested" then the last
and handles lots more cases, including the dreaded "fake /" cases
(found some more hideous ones like that in a bacterial
dataset). GenBank, wow, what a headache!

o I integrated Andrew's SPARK based location parser, and now use it to
parse the locations. spark.py is included in the tarball, but we need
to still figure out how we want to do it in Biopython (once the
GenBank parser is up to snuff). Another big thanks to Andrew for
providing the location parser! I integrated this first before doing
all the testing, so it has been through a workout over here. I found
one case it didn't handle (when you have a "between" location by
itself without parentheses, like '6.27') and made the small fix for
this. Otherwise it performed great!

o Coded up a Record class for GenBank record and added a parser and
consumer that parse GenBank data into it.

o Miscellaneous bug fixes that popped up (hopefully I squashed more
than I introduced :-).

o Better testing -- again thank to Andrew. Have I mentioned yet that
he is my personal hero?

If people have time to download and test this and give me their
feedback I would really appreciate it. I only want to get it into
Biopython if people feel it is up to par (don't want to bring down the 
good name of Biopython :-). I'm especially interested in feedback on
the following points:

o I would really like to hear about anything that causes errors in any 
of the parsers (or my code!).

o Naming of modules -- right now my naming sucks (the "supplimentary"
feature classes, like Location.py and Reference.py are in a module
called 'FeatureInfo', for instance. yeck.), so if people have good
ideas for how to name things I'll definately take 'em. I'm also not
sure where a good place for spark.py to live in Biopython is (BTW, I
think we should include it :-). Finally, I noticed Jeff put his snazzy 
code in GenBank/__init__.py -- Should my GenBank.py go into
__init__.py? Should it be named something else?

o Data transfer -- if everything being transferred okay? Am I messing
anything up/losing data? People hand checking different records for me 
would be very very helpful.

o HTML -- Cayte expressed concerns about parsing GenBank files with a
bunch o' HTML stuck in them. In my opinion it isn't really worth
worrying about this because it is so easy to get the text flat files
-- do lots of people think I should work on html support, or do they
agree with me?

Thanks again for everyone's feedback on the first version!

Brad


From jchang at SMI.Stanford.EDU  Wed Dec 20 19:23:21 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:55 2005
Subject: [Biopython-dev] Second go at GenBank parser
In-Reply-To: <14912.52277.659316.598153@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0012201610060.29393-100000@riboweb.Stanford.EDU>

Hi Brad,

This is great!  You've filled two gaping holes in biopython functionality.  
Please check these in, as I'm sure people will want to start using the
code.

> o Tested on a bunch of different downloads from the ncbi genbank
> directory, so the syntax is much more "battle tested" then the last
> and handles lots more cases, including the dreaded "fake /" cases
> (found some more hideous ones like that in a bacterial
> dataset). GenBank, wow, what a headache!

Good.  GenBank is notoriously hard to deal with, and I suspect work on the
format will be ongoing.

> o I integrated Andrew's SPARK based location parser, and now use it to
> parse the locations. spark.py is included in the tarball, but we need
> to still figure out how we want to do it in Biopython

Yep, definitely a good thing.  Using SPARK is the right way to go.


> o Coded up a Record class for GenBank record and added a parser and
> consumer that parse GenBank data into it.

Thanks!


> I only want to get it into Biopython if people feel it is up to par
> (don't want to bring down the good name of Biopython :-).

Heh.  From what I gather, it's runnable.  Let's get this out the door so
people can start using it, and hopefully give good comments and (even
better) patches.


> o Naming of modules -- right now my naming sucks (the "supplimentary"
> feature classes, like Location.py and Reference.py are in a module
> called 'FeatureInfo', for instance. yeck.), so if people have good
> ideas for how to name things I'll definately take 'em.

Are these meant to be used with SeqFeatures?  If so, how about just
SeqFeature.Location and SeqFeature.Reference?


> I'm also not sure where a good place for spark.py to live in Biopython
> is (BTW, I think we should include it :-).

Where you have it now seems as good a place as any (without the
PGML).  Including it is fine with me.


> Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py --
> Should my GenBank.py go into __init__.py?

Yes.  GenBank is a good name for it, and as per Andrew's earlier email, we
should avoid having code in both GenBank/__init__.py and
GenBank/GenBank.py.


> o HTML -- Cayte expressed concerns about parsing GenBank files with a
> bunch o' HTML stuck in them. In my opinion it isn't really worth
> worrying about this because it is so easy to get the text flat files
> -- do lots of people think I should work on html support, or do they
> agree with me?

Are the HTML-formatted files different?  Does it work if you just strip
the HTML tags?  I guess for HTML-formatted data from GenBank, it would be
nice to handle, but very low priority.  HTML-formatted data from other
sources, no.  If someone needs that functionality, they can submit the
patches!  :)


Jeff


From katel at worldpath.net  Thu Dec 21 04:50:48 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:56 2005
Subject: [Biopython-dev] Second go at GenBank parser
References: <14912.52277.659316.598153@taxus.athen1.ga.home.com>
Message-ID: <001901c06b33$800b4420$010a0a0a@cadence.com>

I ran several files, tonight, on my win98 machine.    There is still a pesky
newline problem that shows up only in the feature section.  If the feature
contains a translation, just the first line appears, followed by backslash -
0-1-2.  The translation feature is the only multiline subfeature I've seen
so far.  I can attach the files if you wish.

I haven't seen the  newlines in the other sections, this time.  Apparently,
they've been removed. The lines of output are long, but this is not a
problem,  because the user can break the lines up easily in his script.

                                                          Cayte