From jchang at smi.stanford.edu  Wed Jan  2 01:03:32 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
In-Reply-To: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com>
References: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com>
Message-ID: <20020101220332.C499@krusty.stanford.edu>

On Mon, Dec 31, 2001 at 03:36:24AM -0700, Andrew Dalke wrote:
> Code is at http://www.biopython.org/~dalke/Bioformats-0.2.py

OK, I've read the README.  I'll say it.  "Wow!  That's cool!"  :)

This'll really simplify things a lot for people, to have a uniform API
to loading and parsing data.

Jeff

From adalke at mindspring.com  Wed Jan  2 08:27:40 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] mixins
Message-ID: <003101c19391$413c5ae0$0201a8c0@josiah.dalkescientific.com>

Been working on mixins all night.  The idea is that
only parts of a file are important -- you may just want
the sequence, or the cross references, or the whatever.
If those fields are consistently tagged (been working on
that as well) then standard parsers can be used for the
different segments.

Some of the experimental mixins I have are
  dbid -- gives the primary/secondary/accessions
  description -- gives the main description text
  dbxref -- cross references to other databases
  features -- sequence features
  sequence -- sequence data

A problem with the standard SAX method is that is
uses a centralized set of methods, like 'startElement'.
Mixins can't each define their own startElements since
only one is called.  So I made a DispatchHandlers
which converts calls like

  startElement('spam', {})
into
  start_spam('spam', {})

in that way, the different handlers could listen only
for their associated event.  And for the 'characters'
method, I have a stack based was to start and stop saving
characters.

when a mixin is done, it calls a specific function back
in the handler, which so far start with 'add_'.

But wait, there's more!  Jeff pointed out namespace
support, which XML supports with a syntax like 'ns:spam'.
It's kinda cumbersome using a ':' as a method name, so I've
translated that to "ns__spam" when I do the dispatching.

This lets people define a new builder with something like

class FastaBuilder(dbid, description, sequence,
                   SaveText, DispatchHandler):
   def __init__(self):
     ... call __init__ on the bases
   def start_record(self, tag, attrs):
        self.id = None
        self.description = None
        self.seq = None
   def add_dbid(self, dbid):
      ...
   def add_sequence(self, seq):
     ...
   def end_record(self, tag):
      self.document = FastaRecord(self.id, self.description. self.seq)

Now, writing that list of mixins is cumbersome, so I used
new.classobject so you can define

FastaBuilderBase = MixinBuilder(dbid, description, sequence)

class FastaBuilder(FastaBuilderBase):
   ...

Another problem with mixins is that they share the same __dict__.
That can lead to hard-to-track-down mixups.  So I've written
a way for a mixing to acquire methods from another handler,
but not share the same __dict__.  It looks like this:

class Handle_sequence(Callback):
  def start_bioformat__sequence(self, tag, attrs):
    self.alphabet = attrs.get("alphabet", "any")
  def end_bioformat__sequence(self, tag):
    seq = Sequence based on the alphabet and the characters
    self.callback(seq)

# Here's the mixin
class sequence:
  def __init__(self):
    acquire(self, Handle_sequence(self, self.add_sequence))
  def add_sequence(self):
    pass

The 'acquire' function pulls off all methods starting with
'start_' and 'end_' and sticks them in the mixin'a namespace.
So it looks like the sequence implements things but it's really
Handle_sequence.  And there's no possibility of 'self.alphabet'
being overridden by anyone else.

(It's actually slightly more complicated than this because
the acquisition can put on its own prefix, which helps with
code reuse.)

Finally!  Since Python is fully introspective, the DispatchHandler
can peer through the class hierarchy to figure out all of the
methods which are defined, and map them back to their proper
SAX tags.  This list of tags can then be used to build a new
expression tree with all the other, unused tags filtered out.
What that means is, if you want to get more fields, stick in a
new mixin, and everything works automatically to get those
fields, with only the expected slowdown associated with the
extra work to identify and parse those fields.

The less data you want, the faster it is.  With bare minimal
(what's needed to convert the data into FASTA format), my
test set of SWISS-PROT 38 takes (estimated) 10 minutes.  With
everything needed for the SProt data structure, it's slightly
under 30 minutes, which is about what the current code
requires.

(Times estimated by extrapolation of my smaller test set.)

Code is in a state of flux and not really work others looking
at it right now.  I'll work on it more tomorrow.  Hope to make
it available on Friday.

                    Andrew


From chapmanb at arches.uga.edu  Thu Jan  3 22:22:51 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
In-Reply-To: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com>
References: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com>
Message-ID: <20020103222251.A3135@ci350185-a.athen1.ga.home.com>

Hey Andrew and all;

[Bioformats]
> I think it's at the stage where the code can be added
> to Biopython proper.  I would like someone else to
> take a look at it first, if only to try it out.  (It
> wouldn't hurt to also say "Wow! That's cool!" :)

I'll second the "Wow! That's cool" from Jeff :-). I like the way 
things are going with the Bioformats module. I just
had some time to play with it, and it is very nice. After some
small modifications to the GenBank format, I got GenBank minimally
working with it. Snazzy -- conversions to Fasta format:

>>> from Bioformats import registry
>>> infile = open("/home/chapmanb/bioppjx/biopython/Tests/GenBank/iro.gb")
>>> format = registry["sequence"].identify(infile)
>>> print format.name
genbank

>>> from Bioformats import IO
>>> infile.seek(0)
>>> writer = IO.io.make_writer(format = "fasta")
>>> for record in IO.io.readFile(infile):
...     print record
...     writer.write(record)
<Bio.SeqRecord.SeqRecord instance at 0x1d0e410>
>AL109817.1
cacaggcccagagccactcctgcctacaggttctgagggctcaggggacctcctgggccctcaggctcttta
gctgagaataagggccctgagggaactacctgcttctcacatccccgggtctctgaccatctgctgtgtgcc
[...]

I like it! Attached is the format registration stuff, that
goes in Bioformats/formats/genbank.py for anyone who is interested
in duplicating this.

I'm definately +1 on checking this into CVS. It seems along the
same spirit as what Thomas was working on in Bio/SeqIO/generic, but
integrates well with Martel. I'm not sure if I really have the full
picture of everything yet, but from what I see it looks good!

I'm excited about the mixin stuff as well -- it seems like it'll
really simplify a lot of repetitive coding for adding new formats. 
Too bad I already did all the repetitive coding for GenBank :-).

At-least-coding-monkeys-get-lots-of-bananas-ly yr's,
Brad
-- 
PGP public key available from http://pgp.mit.edu/

From adalke at mindspring.com  Fri Jan  4 05:37:51 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
Message-ID: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com>

Brad:
>I'll second the "Wow! That's cool" from Jeff :-).

Thanks! to both of you.  And I guess you're running a
2.2 version of Python, since I have some 'yield' statements
in there.  :)

> After some
>small modifications to the GenBank format, I got GenBank minimally
>working with it.

There's going to be a few more changes.  I've been working on
standard tag names for things like identifiers, cross-references,
sequence, and features (with qualifiers).  Seems to work with
well with SWISS-PROT and EMBL.  The idea is to do

   Std.dbid(UntilSep(delimiter = ";"), {"type": "accession"})

and it puts in the correct tags.

(BTW, I'm going to change "delimiter" to "sep".)

>Attached is the format registration stuff, that
>goes in Bioformats/formats/genbank.py for anyone who is interested
>in duplicating this.

Wasn't attached.

>>>> infile.seek(0)

Shouldn't need that.  The identification code should always
reseek the file to the beginning after it's finished.

>I'm definately +1 on checking this into CVS. It seems along the
>same spirit as what Thomas was working on in Bio/SeqIO/generic, but
>integrates well with Martel.

It was.  I looked through the mailings to make sure I read
his (and others') discussions.  It's also (IMNSHO) much better
than the Bioperl and BioJava codes because it can handle
non-sequence formats, like BLAST results, as well.

Should it be under Bio (Bio.Bioformats) or parallel to it?
Unlike Martel, I don't see it as being distributed outside
of Biopython, so I would think under.  And I think the
Biopython code will have hooks to it as well.  Okay, so under
it is.

> I'm not sure if I really have the full
>picture of everything yet, but from what I see it looks good!

I'm giving a short talk Friday morning.  I think I know what
I'm doing well enough now that tomorrow evening I should be
able to write an overview level description of the project.

BTW, for me it was even harder to figure out the full picture.
I had to do one piece at a time until it finally started to
come together.

>I'm excited about the mixin stuff as well -- it seems like it'll
>really simplify a lot of repetitive coding for adding new formats. 
>Too bad I already did all the repetitive coding for GenBank :-).

That was part of the small pieces -- see what works well then
try to abstract from there.

Mixins, however, turned out to be a dead end.  There was a problem
when multiple mixins wanted the same events.  There was also the
annoyance of having to __ all object variables in the hopes of
not getting conflicts with other classes.  So I used a different
approach which actually makes things easier to understand, I hope.
Like I said, tomorrow evening... Hopefully.

                    Andrew
                    dalke@dalkescientific.com


From chapmanb at arches.uga.edu  Fri Jan  4 07:03:43 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
In-Reply-To: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com>
References: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com>
Message-ID: <20020104070342.B3135@ci350185-a.athen1.ga.home.com>

Andrew:
> Thanks! to both of you.  And I guess you're running a
> 2.2 version of Python, since I have some 'yield' statements
> in there.  :)

Yup, always on the cutting edge :-). Though, you may want to watch
out for 'em -- don't think it'll make people too happy to have code
that requires the brand-spanking-new-2.2. Especially since I still
see messages with people using 1.5.2 (gack!).

> There's going to be a few more changes.  

Definately understood. I was just interested in learning what was
going on so I thought I would add GenBank to the fray. I should
eventually work on a GenBank writer, etc. But not right now :-).

> >Attached is the format registration stuff, that
> >goes in Bioformats/formats/genbank.py for anyone who is interested
> >in duplicating this.
> 
> Wasn't attached.

Whoops! I can just add this to CVS (if you don't mind), once you
check things in.

>> >>> infile.seek(0)
> 
> Shouldn't need that.  The identification code should always
> reseek the file to the beginning after it's finished.

Cool. Good to know -- I was just going directly off what was in the
README. Don't want to stray too far from the path and get lost!

> Should it be under Bio (Bio.Bioformats) or parallel to it?
> Unlike Martel, I don't see it as being distributed outside
> of Biopython, so I would think under.  And I think the
> Biopython code will have hooks to it as well.  Okay, so under
> it is.

Sounds good to me. Idly, do you want Bio.Bioformats or Bio.Formats?

Brad
-- 
PGP public key available from http://pgp.mit.edu/

From jchang at smi.stanford.edu  Fri Jan  4 11:43:03 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
In-Reply-To: <20020104070342.B3135@ci350185-a.athen1.ga.home.com>
References: <000601c1950b$dd337340$0201a8c0@josiah.dalkescientific.com> <20020104070342.B3135@ci350185-a.athen1.ga.home.com>
Message-ID: <20020104084303.B189@krusty.stanford.edu>

On Fri, Jan 04, 2002 at 07:03:43AM -0500, Brad Chapman wrote:
> Sounds good to me. Idly, do you want Bio.Bioformats or Bio.Formats?

Hmmm...  That would be:
Bio.Bioformats              vs    Bio.Formats
Bio.Bioformats.Format       vs    Bio.Formats.Format
Bio.Bioformats.formats      vs    Bio.Formats.formats

Actually, I would favor a refactoring that would put end-user modules
like IO, Writer, Registry right under Bio.  This would be consistent
with the idea discussed in the last BOSC of having a wider tree to
make it easier for people to find things.

Jeff

From adalke at mindspring.com  Sat Jan  5 10:06:41 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
Message-ID: <001f01c195fa$962ed9e0$0201a8c0@josiah.dalkescientific.com>

Brad:
>Yup, always on the cutting edge :-). Though, you may want to watch
>out for 'em -- don't think it'll make people too happy to have code
>that requires the brand-spanking-new-2.2. Especially since I still
>see messages with people using 1.5.2 (gack!).

I'm not going to be able to look at getting the code to
work under older Pythons, at least not for a week or so.
There are two places which need to be fixed:
  - weakref was introduced in 2.1 and is used to prevent
      cyclical data structures
  - yield was introduced in 2.2 and is used as part of
      the iteration return value.

The first can't be fixed very easily, but I don't think it
will leak all that much memory.  (Need to investigate.)

The second is easy to fix - I just need to make an iterator
adapter.  But no time for now.  :(

BTW, I now have 90 minutes to talk at the O'Reilly conference.
I'm taking suggestions for what to present.

                    Andrew


From adalke at mindspring.com  Sat Jan  5 10:24:29 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
Message-ID: <002001c195fd$11a253c0$0201a8c0@josiah.dalkescientific.com>

Jeff:
>Hmmm...  That would be:
>Bio.Bioformats              vs    Bio.Formats
>Bio.Bioformats.Format       vs    Bio.Formats.Format
>Bio.Bioformats.formats      vs    Bio.Formats.formats
>
>Actually, I would favor a refactoring that would put end-user modules
>like IO, Writer, Registry right under Bio.  This would be consistent
>with the idea discussed in the last BOSC of having a wider tree to
>make it easier for people to find things.

Very good point, and I had forgotten about that discussion.
Okay, I did a somewhat hybrid solution and put a 'Format'
in front of a couple names but otherwise I merged the two
trees together.  The new modules are:

Format -- information about a format
FormatRegistry  -- knows how to use the format information
FormatIO -- knows how to use the format registry
Std -- defines standard XML tags
Dispatch -- a set of classes to make it easier to mix and
    match handlers
StdHandler -- a set of standard handlers, which use the
    standard tags to build portions of the data
ReseekFile -- help reading from files which don't allow
    reseeking to the beginning of the file
_FmtUtils -- internal support modules
Writer -- base class for the output writers

formatdefs -- high-level description of the formats
expressions -- low-level Martel expressions for the formats
builders -- makes data structures from Martel events
writers -- turns data structures into output

__init__ contains 'formats', which is an instance of a
   FormatRegistry.  It reads the 'formatdefs' directory to
   get the configuration information.
SeqRecord contains an 'io' object, which is an instance
   of FormatIO.

As it is right now, the format support is rather weak.  There
are two formats -- swissprot/38 and an embl variation.  There
is one output format, FASTA.

Here's an example of use

>>> filename = "/home/sac/bioperl-live-sac/t/data/roa1.swiss"
>>> from Bio import SeqRecord
>>> for record in SeqRecord.io.readFile(open(filename)):
...     print record.id
...     print record.seq
...
ROA1_HUMAN
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKR.....
>>>

Here's another (the description really is on a single line)

>>> SeqRecord.io.convert(open(filename))
>ROA1_HUMAN HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN A1 (HELIX-
DESTABILIZING PROTEIN) (SINGLE-STRAND BINDING PROTEIN) (HNRNP
 CORE PROTEIN A1).
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMN
ARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKK
RGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFG
RGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDS
YNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS
SSSSYGSGRRF
>>>

Quickly speaking, here's what's going on:
1) format detection
   The 'formatdefs' contain a description of the different
formats.  Some formats are really lists of other formats, in
a tree.  The tree structure looks like this:

sequence
  |- embl
  |- swissprot
  |     `- swissprot/38
  `- others

The SeqRecord.io contains the default reader format, which is
"sequence".  The sequence format tries each of its children.
Eventually, 'swissprot/38' works, which is returned as the
format definition.

2) find the builder

The SeqRecord.io contains a canonical name for the data type.
In this case it's "SeqRecord."  The file format has its own
canonical name, which is "swissprot/38".  They also have what
I call an abbrev name, which is a name that can be used in
the file system.  The format's abbrev name is 'sprot38'.
So the initial builder is found in

  Bio/builders/SeqRecord/sprot38.py

However, this doesn't exist.  Here's where the hierarchy comes
into play.  The hierarchy must be such that of Y is a child of
X then all the tags which are defined in X must have the same
meanings in Y.  In that way, the parser to build from X can
be used to build from Y.

In other words,
  Bio/builders/SeqRecord/swissprot.py
  Bio/builders/SeqRecord/sequence.py
if one exists, should be just as usable as .../sprot38.py

This reduces the O(NxN) problem to a O(N) problem.

The Bio.Std module defines standard tag names.

3) the format contains the Martel grammer, so once the builder
is found, the file can be parsed.  When a record is parsed,
the content handler (the builders) must end up with a
".document" property.  This is the object to use for a record.
It's also what the DOM object use.  By using this convention
I know how to get the 'record' from the builders, to return
in the for loop.

4) Output conversion is also done with canonical names.
In this case, the SeqRecord also defines a default output
format.  (If not found, it searches down the hierarchy tree
instead of up.)  Writers have the following protocol:
  writeHeader() -- usually does nothing
  write(record) -- write a record
  writeFooter() -- usually does nothing

5) The Dispatch classes are designed to help with making
new data structures easily.  It's too complicated to explain
right now.

6) To add a new format definition:
  a) make sure you understand the hierarchy requirement
  b) take a look at the swissprot and embl expressions/, to see
      how to use the Std module to define tags.  (I need to
      think about 'style' for a while more.)
  c) edit the formatdefs directory to add the new format
      configuration.

Okay, I can't go any further.  I still need to pack for my
trip.  Be back next week.

Enjoy!

                    Andrew
                    dalke@dalkescientific.com


From Y.Benita at pharm.uu.nl  Mon Jan  7 07:01:31 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Errors in compilation for Mac
Message-ID: <B85F4CAB.199E%Y.Benita@pharm.uu.nl>

Hi guys,
I am trying to make the new release for the Mac.
I get a compilation error in cfastpairwisemodule.c:

Error   : cannot convert
'char *' to
'unsigned char *'
cfastpairwisemodule.c line 113   direction_matrix = (char *)NULL;
Project: cfastpairwise.mcp, Target: cfastpairwise, Source File:
cfastpairwisemodule.c

This error is repeated several times.
Any suggestions how to fix it?
Thanks,
Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From idoerg at cc.huji.ac.il  Mon Jan  7 09:05:36 2002
From: idoerg at cc.huji.ac.il (Iddo Friedberg)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Errors in compilation for Mac
In-Reply-To: <B85F4CAB.199E%Y.Benita@pharm.uu.nl>
Message-ID: <Pine.GSO.4.40_heb2.09.0201071600350.19384-100000@new-shum>

Hi Yair,

For want of a better answer, could it be that the parameters with which
you are running your C compiler are a bit too strict, typewise? Seems like
the compiler refuses to cast a char to an unsigned char, which on most
compilers can be mitigated to a 'warning' level error, which does not halt
compilation. Try looking at your compilation parameters.

Don't-know-mac-but-had-my-bout-with-C-comp'ly yours,

Iddo


On Mon, 7 Jan 2002, Yair Benita wrote:

: Hi guys,
: I am trying to make the new release for the Mac.
: I get a compilation error in cfastpairwisemodule.c:
:
: Error : cannot convert
: 'char *' to
: 'unsigned char *'
: cfastpairwisemodule.c line 113 direction_matrix = (char *)NULL;
: Project: cfastpairwise.mcp, Target: cfastpairwise, Source File:
: cfastpairwisemodule.c
:
: This error is repeated several times.
: Any suggestions how to fix it?
: Thanks,
: Yair
: --
: Yair Benita
: Pharmaceutical Proteomics
: Utrecht University
:
: _______________________________________________
: Biopython-dev mailing list
: Biopython-dev@biopython.org
: http://biopython.org/mailman/listinfo/biopython-dev
:

--

Iddo Friedberg                                  | Tel: +972-2-6757374
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/


From Y.Benita at pharm.uu.nl  Mon Jan  7 12:00:42 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] What is reportlab.lib
Message-ID: <B85F92CA.19A6%Y.Benita@pharm.uu.nl>

Hi guys,
Still working on the Mac version of the new biopython.
What is the the module reportlab.lib?
It wasn't in the biopython files and its not in my python library. It is
requested from test_GraphicsChromosome.py
Besides that all tests past.

Iddo, you were right about the compiler, thanks for the tip.
Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From chapmanb at arches.uga.edu  Mon Jan  7 15:48:32 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] What is reportlab.lib
In-Reply-To: <B85F92CA.19A6%Y.Benita@pharm.uu.nl>
Message-ID: <Pine.A41.4.10.10201071544210.289284-100000@archa15.cc.uga.edu>

Hi Yair!

> Still working on the Mac version of the new biopython.

Great to hear you are working on this!

> What is the the module reportlab.lib?

This is the reportlab pdf generation library, which is just needed for
some Graphics stuff. You can get it from:

http://reportlab.com/download.html

I'm pretty sure the module is all python, so I think all you should need
to do is download it, unzip it, and stick it in site-packages. Hopefully
everything will work okay on Macs...

> Besides that all tests past.

Great to hear! So we must be doing pretty well at cross-platformness :-)

Brad


From jchang at smi.stanford.edu  Mon Jan  7 17:40:46 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Errors in compilation for Mac
In-Reply-To: <B85F4CAB.199E%Y.Benita@pharm.uu.nl>
References: <B85F4CAB.199E%Y.Benita@pharm.uu.nl>
Message-ID: <20020107144011.A1590@krusty.stanford.edu>

Thanks for reporting this.  Iddo is correct, that this is probably
harmless and nothing to worry about.  In fact, I don't think gcc has a
warning to catch this case.  Nevertheless, I went through the file and
changed the code to be more careful with typecasts.

Thanks,
Jeff


On Mon, Jan 07, 2002 at 01:01:31PM +0100, Yair Benita wrote:
> Hi guys,
> I am trying to make the new release for the Mac.
> I get a compilation error in cfastpairwisemodule.c:
> 
> Error   : cannot convert
> 'char *' to
> 'unsigned char *'
> cfastpairwisemodule.c line 113   direction_matrix = (char *)NULL;
> Project: cfastpairwise.mcp, Target: cfastpairwise, Source File:
> cfastpairwisemodule.c
> 
> This error is repeated several times.
> Any suggestions how to fix it?
> Thanks,
> Yair
> -- 
> Yair Benita
> Pharmaceutical Proteomics
> Utrecht University
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev

From Y.Benita at pharm.uu.nl  Tue Jan  8 05:12:29 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython available soon
Message-ID: <B860849D.19BA%Y.Benita@pharm.uu.nl>

Hi guys,
I have almost everything ready. Still some problems with the reportlab
module.
I don't have more time to spend before the weekend. If you want to post
MacBiopython already, everything else is ready. Let me know.
Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From jchang at smi.stanford.edu  Tue Jan  8 12:21:44 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython available soon
In-Reply-To: <B860849D.19BA%Y.Benita@pharm.uu.nl>
References: <B860849D.19BA%Y.Benita@pharm.uu.nl>
Message-ID: <20020108092144.A778@krusty.stanford.edu>

Does everything install, except for reportlab?  If so, that's probably
ok.  Correct me if I'm wrong Brad, but I don't think very many people
on the mac will be needing this functionality?

Jeff


On Tue, Jan 08, 2002 at 11:12:29AM +0100, Yair Benita wrote:
> Hi guys,
> I have almost everything ready. Still some problems with the reportlab
> module.
> I don't have more time to spend before the weekend. If you want to post
> MacBiopython already, everything else is ready. Let me know.
> Yair
> -- 
> Yair Benita
> Pharmaceutical Proteomics
> Utrecht University
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev

From chapmanb at arches.uga.edu  Tue Jan  8 12:30:57 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython available soon
In-Reply-To: <20020108092144.A778@krusty.stanford.edu>
References: <B860849D.19BA%Y.Benita@pharm.uu.nl> <20020108092144.A778@krusty.stanford.edu>
Message-ID: <20020108123057.A20508@ci350185-a.athen1.ga.home.com>

Yair:
> > I have almost everything ready. Still some problems with the reportlab
> > module.
> > I don't have more time to spend before the weekend. If you want to post
> > MacBiopython already, everything else is ready. Let me know.

That would be great -- I think you should just go ahead and post what
you've got -- I can take a look at the reportlab/graphics problems and
see if I can figure them out.

Jeff:
> Does everything install, except for reportlab?  If so, that's probably
> ok.  Correct me if I'm wrong Brad, but I don't think very many people
> on the mac will be needing this functionality?

Yeah, I definately wouldn't consider non-working reportlab stuff a major
problem. I can see if I can figure anything out and iron out any
glitches before the next release so that we won't have problems with it.
No one will need reportlab unless they want graphics, which is a very
minor component.

pdf-generation-is-only-necessary-around-poster-making-time-ly yr's,
Brad

From Y.Benita at pharm.uu.nl  Tue Jan  8 12:38:25 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython download
Message-ID: <B860ED20.19CB%Y.Benita@pharm.uu.nl>

OK than, everything works except:

1. Reportlab module- I will not to fix it unless someone asks for it.
2. Local blast. It is compiled in the most lame way for the Mac. I have to
go and fix stuff in the blast code and then recompile. Its too much work for
now but its on my list.

MacBiopython 1.00a4 can be downloaded from:
http://homepage.mac.com/ybenita

Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From chapmanb at arches.uga.edu  Tue Jan  8 20:32:32 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython download
In-Reply-To: <B860ED20.19CB%Y.Benita@pharm.uu.nl>
References: <B860ED20.19CB%Y.Benita@pharm.uu.nl>
Message-ID: <20020108203232.A12021@ci350185-a.athen1.ga.home.com>

Hi Yair;

> 1. Reportlab module- I will not to fix it unless someone asks for it.

No problem. I think that sounds like a good plan. If I have time
I'll try to take a look at it, since I introduced that dependency.

> 2. Local blast. It is compiled in the most lame way for the Mac. I have to
> go and fix stuff in the blast code and then recompile. Its too much work for
> now but its on my list.

Sounds like a good plan.

> MacBiopython 1.00a4 can be downloaded from:
> http://homepage.mac.com/ybenita

Thanks much for getting it together! I've included in on the
biopython download page. By the way, this release is compiled for
python 2.1? Just want to make sure I've got the right information.

Thanks again.
Brad
-- 
PGP public key available from http://pgp.mit.edu/

From Y.Benita at pharm.uu.nl  Wed Jan  9 04:54:51 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] MacBiopython download
In-Reply-To: <20020108203232.A12021@ci350185-a.athen1.ga.home.com>
Message-ID: <B861D1FA.19DC%Y.Benita@pharm.uu.nl>

on 9/1/2002 2:32, Brad Chapman at chapmanb@arches.uga.edu wrote:

> Thanks much for getting it together! I've included in on the
> biopython download page. By the way, this release is compiled for
> python 2.1? Just want to make sure I've got the right information.

New biopython release was compiled for the MacPython 2.2 not for 2.1, that?s
one of the reasons reportlab does not work. I get an error that it was
compiled for 2.1 so I have to recompile all their extensions for 2.2

Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From thomas at cbs.dtu.dk  Fri Jan 11 16:09:08 2002
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] WU-Blast parser ?
Message-ID: <y9vpu4goarv.fsf@delphinus.cbs.dtu.dk>

Hej,

Does someone have a working WU-blast parser or are there any plans to make a
NCBIstandalone compatible parser for WU-Blast?

cheers
-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From adalke at mindspring.com  Mon Jan 14 21:08:37 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] WU-Blast parser ?
Message-ID: <043d01c19d69$8bd7ff00$0201a8c0@josiah.dalkescientific.com>

Hej Thomas,

>Does someone have a working WU-blast parser or are there any plans to make
a
>NCBIstandalone compatible parser for WU-Blast?

I started working on one last week for Martel.  If you send me some
sample data files it would help.  I've dug up a few, but I don't
know how much that format changes over time.

I'm also trying to get the Martel tag names to be similar to
the Bioperl ones, with the though of normalizing both the different
homology search results across Biopython as well as between Biopython
and Bioperl.

                    Andrew


From adalke at mindspring.com  Tue Jan 15 09:36:16 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] changes
Message-ID: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com>

I've committed some changes in CVS:

 - added 'fasta' reader
 - modified the 'genbank' reader to take the new-style Std tags
    (also fixed the "" bug in feature qualifier values)
    (the parsing is about 1/2 the performance of SWISS-PROT;
     haven't figured out why)
 - added a 'block' parser, but no builder yet
 - added the beginnings of a 'blast' parser
 - added a DBXRef module for database cross references
 - a couple additions to Martel:
     - SkipLinesUntil / SkipLinesTo ... equivalent to
           while not pattern.match(line):
               line = infile.next()
     - can now iterate HeaderFooter records
 - the SeqRecord now stores features and database cross references
 - the SeqRecord stores a Seq instead of a string
     - genbank and swissprot records set the correct alphabet type
 - the 'parse' and 'identify' commands now take XML 'source' objects,
     which can be a URL or a file handle.

Huh.  Guess I was busy this evening.

Still need to run full regressions against the new formats.  BTW,
the FASTA reader tries to parse the NCBI fields and generate appropriate
dbxref fields for the SeqRecord.

                    Andrew
                    dalke@dalkescientific.com


From thomas at cbs.dtu.dk  Sun Jan 13 11:09:49 2002
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: "Andrew Dalke"'s message of "Tue, 15 Jan 2002 07:36:16 -0700"
References: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com>
Message-ID: <y9vadvimdv6.fsf@delphinus.cbs.dtu.dk>

... Thats bad ....

A CVS update breaks _all_ existing code which uses the Fasta reader !
(because of the 2.2 specific things .. weakref, generators etc.)

_URGENT_:Does anybody know how to undo an CVS update ?

at-hyperspeed-reading-the-CVS-manual-for-the-first-time'ly yr's
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From chapmanb at arches.uga.edu  Wed Jan 16 11:13:36 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: <y9vadvimdv6.fsf@delphinus.cbs.dtu.dk>
References: <000801c19dd2$01687340$0201a8c0@josiah.dalkescientific.com> <y9vadvimdv6.fsf@delphinus.cbs.dtu.dk>
Message-ID: <20020116111336.A56180@ci350185-a.athen1.ga.home.com>

> A CVS update breaks _all_ existing code which uses the Fasta reader !
> (because of the 2.2 specific things .. weakref, generators etc.)

Hmmm, I thought I fixed this...

<Looking at the revision history>

Looks like Andrew backed out my change. Not sure why. Maybe he fixed the
2.2 specific stuff, but I guess not. Don't know. 
Anyways, if you want to fix things now, the only
intrustive part is in Bio/SeqRecord. Just comment out the FormatIO
related stuff at the top of this file, and you should be able to use
everything except the new stuff just fine.

No-CVS-mojo-needed-to-comment-out-bad-parts-ly yr's,
Brad

From adalke at mindspring.com  Wed Jan 16 11:29:53 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
Message-ID: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com>

>A CVS update breaks _all_ existing code which uses the Fasta reader !
>(because of the 2.2 specific things .. weakref, generators etc.)

weakref was added in 2.1
I got rid of the 'yield' statements in my code.
  .... but forgot to get rid of the __future__ statement

I've committed changes to make 'Bio.Fasta' importable under 2.0

[dalke@pw600a ~]$ date
Wed Jan 16 11:13:47 EST 2002
[dalke@pw600a ~]$ python2.0
Python 2.0 (#4, Dec  8 2000, 21:23:00)
[GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from Bio import Fasta
>>>


>_URGENT_:Does anybody know how to undo an CVS update ?

You can check out based on a date.  Or you update to the newest
code in CVS.

Sorry about the problems.  I tested a few things with 2.1 on my machine but
not all the modules, and I didn't try under 2.0 or older at all until
just now.

                    Andrew


From jchang at smi.stanford.edu  Wed Jan 16 11:41:34 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com>
References: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com>
Message-ID: <20020116084133.A483@krusty.stanford.edu>

Right now, Biopython requires a minimum of Python 2.0.  Is it time to
up the specs again?  What do people think?  How much easier do the
2.1/2.2 features make your lives?  Where would you use them, and how
much backwards compatibility would it break?

Jeff

From adalke at mindspring.com  Wed Jan 16 11:56:44 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
Message-ID: <009601c19eae$c7ca3320$0201a8c0@josiah.dalkescientific.com>

>How much easier do the
>2.1/2.2 features make your lives?  Where would you use them, and how
>much backwards compatibility would it break?

The features I use in 2.1 are:
  warnings - to generate warnings
  weakref - to build complex data structures that are appropriately
     garbage collected

The features I would like to use from 2.2 are:
  yield - can in cases be done with adapters, but there are a few
     places where it's nice
  iterators - have been using adapters, but it's nice

I would also like the new __slots__ mechanism in 2.2, but I've
been using __getattr__ tricks long enough that that comes naturally
to me; but not to others.

                    Andrew


From chapmanb at arches.uga.edu  Wed Jan 16 12:20:56 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: <20020116084133.A483@krusty.stanford.edu>
References: <008a01c19eab$0793af80$0201a8c0@josiah.dalkescientific.com> <20020116084133.A483@krusty.stanford.edu>
Message-ID: <20020116122056.A56273@ci350185-a.athen1.ga.home.com>

Jeff asks:
> Right now, Biopython requires a minimum of Python 2.0.  Is it time to
> up the specs again?  What do people think?  How much easier do the
> 2.1/2.2 features make your lives?  Where would you use them, and how
> much backwards compatibility would it break?

I'm all for moving forward with the minimum requirement if people think
it helps them. The number one thing I like about 2.1 is the warning
framework. 2.2 has the nice new Iterators and Generators which I would
be keen on learning to use. It's probably too early to require 2.2, but
starting to require 2.1 would help move us towards requiring 2.2 sooner
:-).

Newer-is-always-better-than-older-except-when-it-comes-to-bourbon-ly yr's,
Brad


From adalke at mindspring.com  Wed Jan 16 12:39:53 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
Message-ID: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com>

Brad:
>It's probably too early to require 2.2, but
>starting to require 2.1 would help move us towards requiring 2.2 sooner
>:-).

I agree.  If my clients are any judge, they waited until 2.1
to install a 2.x Python, and they don't want to upgrade just yet.

                    Andrew


From jchang at smi.stanford.edu  Wed Jan 16 12:51:18 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com>
References: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com>
Message-ID: <20020116095118.A484@krusty.stanford.edu>

What about you, Thomas?  What version are you using, and what do you
think about nudging the the requirements up to 2.1?

I'm currently using an alpha of 2.2 and really like the generators.
I'm using them in some other code, but haven't required them in
biopython, yet.

Jeff


On Wed, Jan 16, 2002 at 10:39:53AM -0700, Andrew Dalke wrote:
> Brad:
> >It's probably too early to require 2.2, but
> >starting to require 2.1 would help move us towards requiring 2.2 sooner
> >:-).
> 
> I agree.  If my clients are any judge, they waited until 2.1
> to install a 2.x Python, and they don't want to upgrade just yet.
> 
>                     Andrew
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev

From thomas at cbs.dtu.dk  Mon Jan 14 03:15:06 2002
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:09 2005
Subject: ARGHHH [Biopython-dev] changes
In-Reply-To: Jeffrey Chang's message of "Wed, 16 Jan 2002 09:51:18 -0800"
References: <00d401c19eb4$ce8af680$0201a8c0@josiah.dalkescientific.com>
	<20020116095118.A484@krusty.stanford.edu>
Message-ID: <y9v1ygtmjqt.fsf@delphinus.cbs.dtu.dk>

Jeffrey Chang <jchang@smi.stanford.edu> writes:

> What about you, Thomas?  What version are you using, and what do you
> think about nudging the the requirements up to 2.1?
> 
> I'm currently using an alpha of 2.2 and really like the generators.
> I'm using them in some other code, but haven't required them in
> biopython, yet.

I personally always like to play around with the latest toys, but the
last halv year I became a little lazy. Mostly because any Python update
has to be done at the same time on all of my machines/accounts which runs PyPhy
(currently Denmark, Sweden, England, USA and Canada ... )

Right now I'm running 2.0, but I am willing to upgrade ..
I dont think we should require 2.1, either we stay at 2.0 or climb the
whole way to 2.2. That would require the least  inconvenience for users
(and administrators) (IMHO)

So what are the most cool features in 2.1 ans 2.2 ?

cheers
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From katel at worldpath.net  Thu Jan 17 18:13:27 2002
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Effective ways to use Martel
Message-ID: <001801c19fac$934d6a00$010a0a0a@cadence.com>

   Background:  Andrew took the time to look into a pesky bug in my
saf_format( not yet committed ).  The consumer occasionally received tags
that were sliced at the wrong point.  Think of an enzyme that attaches to
the wrong base by a few bases.

  Andrew pointed out a bug where I assumed a name was always 14 characters
where it should be a maximum of 14 characters.  However this did not explain
the bug.  Further investigation showed the problem  to be a limitation of
EventGenerator.  My format used embedded tags and EventGenerator does not
support them.  Andrew recommended his new Dispatch module as an alternative.

 Shud this be documented?  Andrew says in the future Dispatch will be the
preferred tool.  But without documentation what keeps users from uing the
old technique and running into the same issue?

                                               Cayte


 .


From adalke at mindspring.com  Fri Jan 18 07:36:31 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] my latest trick
Message-ID: <002b01c1a01c$c2cc9380$0201a8c0@josiah.dalkescientific.com>

The FormatIO system now supports BLAST and WU-BLAST, although it
isn't yet finished (missing a few details, like locations and the
sequence name).  Here's what it looks like

First, autodetection of the file format

>>> import Bio
>>> format = Bio.formats["search"].identify("blastp.2.wu")
>>> format.name
'wu-blastp'
>>>

It isn't figuring things out from the extension - it's testing
the contents of the file.  Here's proof.

>>> import os
>>> os.symlink("blastp.2.wu", "unknown.dat")
>>> Bio.formats["search"].identify("unknown.dat").name
'wu-blastp'
>>>

And I can load this into memory as a 'Search' object.

>>> from Bio import Search
>>> search = Search.io.read("unknown.dat").next()
      # The 'next' is because the FormatIO system assumes multiple
      # records.  I have an idea of how to fix that.
>>> search.query.description
'YEL060C vacuolar protease B'
>>> search.algorithm.name
'BLASTP'
>>> len(search.hits)
12
>>>

>>> for hit in search.hits:
...     print hit.name, hit.description
...     print len(hit.hsps), "hsps"
...     for hsp in hit.hsps:
...             print hsp.query.seq
...             print hsp.homology.seq
...             print hsp.subject.seq
...

  (Lots printed out)

Many of the fields are stored, like

>>> for k, v in search.statistics.items():
...     print k, "=", repr(v)
...
total_time = ' 0.95u 0.08s 1.03t  Elapsed: 00:00:01'
num_threads = 1
posted_date = ' 3:27 PM PST Mar 9, 1998'
start_time = 'Mon Mar  9 15:57:59 1998'
num_dfa_states = ' 569 (112 KB)'
total_dfa_size = ' 358 KB (384 KB)'
database = ' /tmpblast/PDBUNIQ'
release_date = ' unknown'
format = ' BLAST'
num_sequences_in_database = 2335
num_sequences_satisying_E = 12
end_time = 'Mon Mar  9 15:58:00 1998'
neighborhood_generation_time = ' 0.01u 0.00s 0.01t  Elapsed: 00:00:00'
num_letters_in_database = 479690
search_cpu_time = ' 0.90u 0.05s 0.95t  Elapsed: 00:00:01'
database_title = ' PDBUNIQ'
>>>

>>> hsp.query.positives
56
>>> hsp.query.frac_positives
0.45528455284552843
>>>

Normalization is still a problem, as you can see from the untrimmed
strings.  And I don't quite get everything .. it's pretty tedious.
(There's a few fields in the hit I also don't handle, like what do
I do with 'P(2)' compared to 'P'?  Someone with a better technical
understanding of the details of the algorithms needs to help me.
Perhaps in Tuscon.)

Biggest thing missing is failure cases.  The data files I found
were all for successful runs.

The format expressions get hairy.  The biggest problems occur when
formatY is almost like formatX except for a small change in the
bottom of the expression tree.  Then the whole tree needs to be
reconstructed, which is noisy.  I'm thinking about possibilities
like

  blastn = Martel.replace_group(blastp, "hsp_info", blastn_hsp_info)

All these tree editing methods are ad hoc.  I keep wondering what
it would be like to convert the tree to a DOM then use the DOM
methods to manipulate the structure.  That'll have to wait for
Martel version 2.

                    Andrew
                    dalke@dalkescientific.com


From Y.Benita at pharm.uu.nl  Tue Jan 22 04:05:34 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bug in urllib of macpython
Message-ID: <B872E9EE.1B31%Y.Benita@pharm.uu.nl>

Due to a bug in urllib of biopython most of the www are not functioning
well. The bug is known and will be taken care of by the python developers.
However, I had to use another function to replace the command
urllib.urlopen()

The following files were changed:
NCBI.py
ExPAsy.py
InterPro.py
SCOP.py

Besides these files, is there any other place where the urllib.urlopen
command is used?

Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From adalke at mindspring.com  Tue Jan 22 20:32:34 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] latest checking
Message-ID: <00e201c1a3ad$d564b060$0201a8c0@josiah.dalkescientific.com>

Okay, I've updated CVS to my latest working version of
Martel and the Bioformat code.

Most of this commit was related to performance.  I can
go into details, but it isn't worth it since it isn't
there any more.

I can currently convert sprot38.dat from SWISS-PROT to
a full SeqRecord in just under 20 minutes (extrapolated).
This is down from about 28 minutes, so roughly 30% faster.
If I really had to, I figured out a hack approach that
might shave another few minutes off the time.

The ncbi blast parser takes about 4.7 seconds to parse
my main test file while Jeff's code takes 3.7 seconds.
I managed to whittle it down from 15 seconds, so I'm
pretty happy.  The hack approach might get me another
half second, but I'm not doing quite everything Jeff
does so no predictions that I can match his performance.

The biggest performance problem is Python's function call
overhead.  For some thing like
   <name>Andrew</name>
there are normally at least three function calls
   startElement("name", {})
   characters("Andrew")
   endElement("name)

The ContentHandler can be written as a chain of if/else
statements,
  def startElement(self, tag, attrs):
    if tag == ..
      ...
    elif tag == "name":
      self.s = ""
      self.save_characters = 1
    ...
  def characters(self, s):
    if self.save_characters:
      self.s += s
  def endElement(self, tag):
    ...
    elif tag == "name":
      print "I have", self.s
      self.save_characters = 0

The problem with this is the comparison tests ("==")
are linear in the number of tags.  Even if sorted to
present most common tags first, there's still quite
a bit of overhead -- perhaps a few dozen equality tests
for a handful of lines of "real" code.

One thing I tried last year was a "dispatch" handler,
which looks like this

  def startElement(self, tag, attrs):
    f = getattr(self, "start_" + tag)
    if f is not None:
      f(tag, attrs)

  def endElement(self, tag):
    f = getattr(self, "end_" + tag)
    if f is not None:
      f(tag, attrs)
  ...
  def start_name(self, tag, attrs):
    self.s = ""
    self.save_characters = 1
  def end_name(self, tag):
    print "I have", self.s
    self.save_characters = 0

This replaces the equality tests with a getattr - which
is a dictionary lookup - and an extra function call. 
This turns out to be faster, at least in my test.

In the latest Martel distribution, I've one-upped this.
When the Dispatch.Dispatcher() starts up, it introspects
itself and finds all the method names which start with
"start_" and "end_", then builds a table mapping tag
name to function, so the dispatch doesn't need to create
an intermediate string

  def startElement(self, tag, attrs):
    f = self._start_table.get(tag)
    if f is not None:
      f(tag, attrs)

I then tweaked the Martel.Parser so if the ContentHandler
is a Dispatcher then specialized parser code reaches
into the class to get its _start_table and _end_table.
(A la a C++ "friend" class.)  This reduces function call
overhead in two ways:
  - there isn't the intermediate startElement/endElement
     call
  - if there's no handler for that tag then there are no
     function calls at all.

These tricks really make an appreciable performance
difference, don't change the normal API, and isn't all
that hard to understand, which is why I can justify
breaking some OO boundaries.

I think there's one other performance problem in the
code, which is that state information is passed around
through class attributes.  Attribute lookups go through
a dict lookup while regular local variables are constant
time lookup and quite a bit faster.  I can't think of
any way to get around this, except that new-style Python
objects support __slots__ which might be faster.

                    Andrew
                    dalke@dalkescientific.com


From adalke at mindspring.com  Tue Jan 22 20:53:32 2002
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Effective ways to use Martel
Message-ID: <00e901c1a3b0$c34c1780$0201a8c0@josiah.dalkescientific.com>

Cayte:
>My format used embedded tags and EventGenerator does not
>support them.  Andrew recommended his new Dispatch module
>as an alternative.

The new Dispatch module; an example.

Start by reading previous email.

Here's a simple format definition for a FASTA file.

from Martel import Str, Group, UntilEol, AssertNot, Rep, AnyEol

header = (Str(">") +
          Group("description", UntilEol()) + 
          AnyEol())
seqline = (AssertNot(Str(">")) + 
           Group("sequence", UntilEol()) +
           AnyEol())

record = Group("record", header + Rep(seqline))

format = Rep(record)

Suppose you want to print the sequence length and the
header definition.  Here's how to do it with the Dispatcher.

from Martel import Dispatch

class SeqLength(Dispatch.Dispatcher):
  def start_record(self, tag, attrs):
    self.seqlen = 0

  def start_description(self, tag, attrs):
    self.save_characters()
  def end_description(self, tag):
    self.description = self.get_characters()

  def start_sequence(self, tag, attrs):
    self.save_characters()
  def end_sequence(self, tag):
    self.seqlen += len(self.get_characters())

  def end_record(self, tag):
    print self.seqlen, "in", self.description

This Dispatcher is a regular ContentHandler so is used
like this, assuming that "test.fasta" contains a FASTA
file.

p = format.make_parser()
p.setContentHandler(SeqLength())
p.parse(open("test.fasta"))

On my test data set, it looks like this:

378 in AK1H_ECOLI/114-431
389 in AKH_HAEIN/114-431
389 in AKH1_MAIZE/117-440
378 in AK2H_ECOLI/112-431
381 in AK1_BACSU/66-374
411 in AK2_BACST/63-370
411 in AK2_BACSU/63-373
411 in AKAB_CORFL/63-379
411 in AKAB_MYCSM/63-379
377 in AK3_ECOLI/106-407
391 in AK_YEAST/134-472


The new thing in this example is the "save_characters()"
and "get_characters()".  This is a stack-based approach
for getting all the characters between a start-tag and
and end-tag.  So long as the calls are balanced then
many different elements can get characters without
trouncing on each other's feet.

Hmmm, need an example which shows this support for
overlaps.

> Shud this be documented?  Andrew says in the future Dispatch
>will be the preferred tool.  But without documentation what
>keeps users from uing the
>old technique and running into the same issue?

Yes, it should be documented.  It also depends on if the
work I've been doing has gotten to the point where we
can start thinking about deprecating the existing code.
In which case the documentation is easy - at the top of
the module say "DEPRECATED - SEE XXX.py"

                    Andrew
                    dalke@dalkescientific.com


From katel at worldpath.net  Wed Jan 23 00:01:01 2002
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Effective ways to use Martel
References: <00e901c1a3b0$c34c1780$0201a8c0@josiah.dalkescientific.com>
Message-ID: <004801c1a3ca$f51fae60$010a0a0a@cadence.com>

----- Original Message -----
From: "Andrew Dalke" <adalke@mindspring.com>
To: <biopython-dev@biopython.org>
Sent: Tuesday, January 22, 2002 5:53 PM
Subject: Re: [Biopython-dev] Effective ways to use Martel


> Cayte:
> >My format used embedded tags and EventGenerator does not
> >support them.  Andrew recommended his new Dispatch module
> >as an alternative.
>
  I plan to use Dispatch for saf, but before you got back to me I started
porting the ECell perl script.  I'd like to finish while perl is fresh in my
mind because I have to relearn the cryptic codes every time I revisit perl..

                                                      Cayte


From mark at acoma.Stanford.EDU  Wed Jan 23 15:36:51 2002
From: mark at acoma.Stanford.EDU (Mark Lambrecht)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Prosite
Message-ID: <Pine.GSO.4.33.0201231224110.4043-200000@acoma.Stanford.EDU>

Hi,

Thanks for all the excellent Biopython code.
I used the Prosite parser and it breaks on a number of CC and MA lines.
Maybe there is a new version of the prosite.dat file ?
We added some code to the Bio/Prosite/__init__.py , and commented it with
## (lambrecht/dyoo)
Then everything works again but possibly doesn't use the information in
these lines.
I attached the __init__.py
Could you take a look ?

Thanks !!

Mark


--------------------------------------------------------------------------
Mark Lambrecht
Postdoctoral Research Fellow
The Arabidopsis Information Resource	FAX: (650) 325-6857
Carnegie Institution of Washington	Tel: (650) 325-1521 ext.397
Department of Plant Biology		URL: http://arabidopsis.org/
260 Panama St.
Stanford, CA 94305
--------------------------------------------------------------------------
-------------- next part --------------
# Copyright 1999 by Jeffrey Chang.  All rights reserved.
# This code is part of the Biopython distribution and governed by its
# license.  Please see the LICENSE file that should have been included
# as part of this package.

# Copyright 2000 by Jeffrey Chang.  All rights reserved.
# This code is part of the Biopython distribution and governed by its
# license.  Please see the LICENSE file that should have been included
# as part of this package.

"""Prosite

This module provides code to work with the prosite.dat file from
Prosite.
http://www.expasy.ch/prosite/

Tested with:
Release 15.0, July 1998
Release 16.0, July 1999


Classes:
Record                Holds Prosite data.
PatternHit            Holds data from a hit against a Prosite pattern.
Iterator              Iterates over entries in a Prosite file.
Dictionary            Accesses a Prosite file using a dictionary interface.
ExPASyDictionary      Accesses Prosite records from ExPASy.
RecordParser          Parses a Prosite record into a Record object.

_Scanner              Scans Prosite-formatted data.
_RecordConsumer       Consumes Prosite data to a Record object.


Functions:
scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns.
index_file            Index a Prosite file for a Dictionary.
_extract_record       Extract Prosite data from a web page.
_extract_pattern_hits Extract Prosite patterns from a web page.

"""
__all__ = [
    'Pattern',
    'Prodoc',
    ]
from types import *
import string
import re
import sgmllib
from Bio import File
from Bio import Index
from Bio.ParserSupport import *
from Bio.WWW import ExPASy
from Bio.WWW import RequestLimiter

class Record:
    """Holds information from a Prosite record.

    Members:
    name           ID of the record.  e.g. ADH_ZINC
    type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
    accession      e.g. PS00387
    created        Date the entry was created.  (MMM-YYYY)
    data_update    Date the 'primary' data was last updated.
    info_update    Date data other than 'primary' data was last updated.
    pdoc           ID of the PROSITE DOCumentation.
    
    description    Free-format description.
    pattern        The PROSITE pattern.  See docs.
    matrix         List of strings that describes a matrix entry.
    rules          List of rule definitions.  (strings)

    NUMERICAL RESULTS
    nr_sp_release  SwissProt release.
    nr_sp_seqs     Number of seqs in that release of Swiss-Prot. (int)
    nr_total       Number of hits in Swiss-Prot.  tuple of (hits, seqs)
    nr_positive    True positives.  tuple of (hits, seqs)
    nr_unknown     Could be positives.  tuple of (hits, seqs)
    nr_false_pos   False positives.  tuple of (hits, seqs)
    nr_false_neg   False negatives.  (int)
    nr_partial     False negatives, because they are fragments. (int)

    COMMENTS
    cc_taxo_range  Taxonomic range.  See docs for format
    cc_max_repeat  Maximum number of repetitions in a protein
    cc_site        Interesting site.  list of tuples (pattern pos, desc.)
    cc_skip_flag   Can this entry be ignored?

    DATA BANK REFERENCES - The following are all
                           lists of tuples (swiss-prot accession,
                                            swiss-prot name)
    dr_positive
    dr_false_neg
    dr_false_pos
    dr_potential   Potential hits, but fingerprint region not yet available.
    dr_unknown     Could possibly belong

    pdb_structs    List of PDB entries.

    """
    def __init__(self):
        self.name = ''
        self.type = ''
        self.accession = ''
        self.created = ''
        self.data_update = ''
        self.info_update = ''
        self.pdoc = ''
    
        self.description = ''
        self.pattern = ''
        self.matrix = []
        self.rules = []

        self.nr_sp_release = ''
        self.nr_sp_seqs = ''
        self.nr_total = (None, None)
        self.nr_positive = (None, None)
        self.nr_unknown = (None, None)
        self.nr_false_pos = (None, None)
        self.nr_false_neg = None
        self.nr_partial = None

        self.cc_taxo_range = ''
        self.cc_max_repeat = ''
        self.cc_site = []
        self.cc_skip_flag = ''

        self.dr_positive = []
        self.dr_false_neg = []
        self.dr_false_pos = []
        self.dr_potential = []
        self.dr_unknown = []

        self.pdb_structs = []

class PatternHit:
    """Holds information from a hit against a Prosite pattern.

    Members:
    name           ID of the record.  e.g. ADH_ZINC
    accession      e.g. PS00387
    pdoc           ID of the PROSITE DOCumentation.
    description    Free-format description.
    matches        List of tuples (start, end, sequence) where
                   start and end are indexes of the match, and sequence is
                   the sequence matched.

    """
    def __init__(self):
        self.name = None
        self.accession = None
        self.pdoc = None
        self.description = None
        self.matches = []
    def __str__(self):
        lines = []
        lines.append("%s %s %s" % (self.accession, self.pdoc, self.name))
        lines.append(self.description)
        lines.append('')
        if len(self.matches) > 1:
            lines.append("Number of matches: %s" % len(self.matches))
        for i in range(len(self.matches)):
            start, end, seq = self.matches[i]
            range_str = "%d-%d" % (start, end)
            if len(self.matches) > 1:
                lines.append("%7d %10s %s" % (i+1, range_str, seq))
            else:
                lines.append("%7s %10s %s" % (' ', range_str, seq))
        return string.join(lines, '\n')
    
class Iterator:
    """Returns one record at a time from a Prosite file.

    Methods:
    next   Return the next record from the stream, or None.

    """
    def __init__(self, handle, parser=None):
        """__init__(self, handle, parser=None)

        Create a new iterator.  handle is a file-like object.  parser
        is an optional Parser object to change the results into another form.
        If set to None, then the raw contents of the file will be returned.

        """
        if type(handle) is not FileType and type(handle) is not InstanceType:
            raise ValueError, "I expected a file handle or file-like object"
        self._uhandle = File.UndoHandle(handle)
        self._parser = parser

    def next(self):
        """next(self) -> object

        Return the next Prosite record from the file.  If no more records,
        return None.

        """
        # Skip the copyright info, if it's the first record.
        line = self._uhandle.peekline()
        if line[:2] == 'CC':
            while 1:
                line = self._uhandle.readline()
                if not line:
                    break
                if line[:2] == '//':
                    break
                if line[:2] != 'CC':
                    raise SyntaxError, \
                          "Oops, where's the copyright?"
        
        lines = []
        while 1:
            line = self._uhandle.readline()
            if not line:
                break
            lines.append(line)
            if line[:2] == '//':
                break
            
        if not lines:
            return None
            
        data = string.join(lines, '')
        if self._parser is not None:
            return self._parser.parse(File.StringHandle(data))
        return data

class Dictionary:
    """Accesses a Prosite file using a dictionary interface.

    """
    __filename_key = '__filename'
    
    def __init__(self, indexname, parser=None):
        """__init__(self, indexname, parser=None)

        Open a Prosite Dictionary.  indexname is the name of the
        index for the dictionary.  The index should have been created
        using the index_file function.  parser is an optional Parser
        object to change the results into another form.  If set to None,
        then the raw contents of the file will be returned.

        """
        self._index = Index.Index(indexname)
        self._handle = open(self._index[Dictionary.__filename_key])
        self._parser = parser

    def __len__(self):
        return len(self._index)

    def __getitem__(self, key):
        start, len = self._index[key]
        self._handle.seek(start)
        data = self._handle.read(len)
        if self._parser is not None:
            return self._parser.parse(File.StringHandle(data))
        return data

    def __getattr__(self, name):
        return getattr(self._index, name)

class ExPASyDictionary:
    """Access PROSITE at ExPASy using a read-only dictionary interface.

    """
    def __init__(self, delay=5.0, parser=None):
        """__init__(self, delay=5.0, parser=None)

        Create a new Dictionary to access PROSITE.  parser is an optional
        parser (e.g. Prosite.RecordParser) object to change the results
        into another form.  If set to None, then the raw contents of the
        file will be returned.  delay is the number of seconds to wait
        between each query.

        """
        self.parser = parser
        self.limiter = RequestLimiter(delay)

    def __len__(self):
        raise NotImplementedError, "Prosite contains lots of entries"
    def clear(self):
        raise NotImplementedError, "This is a read-only dictionary"
    def __setitem__(self, key, item):
        raise NotImplementedError, "This is a read-only dictionary"
    def update(self):
        raise NotImplementedError, "This is a read-only dictionary"
    def copy(self):
        raise NotImplementedError, "You don't need to do this..."
    def keys(self):
        raise NotImplementedError, "You don't really want to do this..."
    def items(self):
        raise NotImplementedError, "You don't really want to do this..."
    def values(self):
        raise NotImplementedError, "You don't really want to do this..."
    
    def has_key(self, id):
        """has_key(self, id) -> bool"""
        try:
            self[id]
        except KeyError:
            return 0
        return 1

    def get(self, id, failobj=None):
        try:
            return self[id]
        except KeyError:
            return failobj
        raise "How did I get here?"

    def __getitem__(self, id):
        """__getitem__(self, id) -> object

        Return a Prosite entry.  id is either the id or accession
        for the entry.  Raises a KeyError if there's an error.
        
        """
        # First, check to see if enough time has passed since my
        # last query.
        self.limiter.wait()

        try:
            handle = ExPASy.get_prosite_entry(id)
        except IOError:
            raise KeyError, id
        try:
            handle = File.StringHandle(_extract_record(handle))
        except ValueError:
            raise KeyError, id
        
        if self.parser is not None:
            return self.parser.parse(handle)
        return handle.read()

class RecordParser(AbstractParser):
    """Parses Prosite data into a Record object.

    """
    def __init__(self):
        self._scanner = _Scanner()
        self._consumer = _RecordConsumer()

    def parse(self, handle):
        self._scanner.feed(handle, self._consumer)
        return self._consumer.data

class _Scanner:
    """Scans Prosite-formatted data.

    Tested with:
    Release 15.0, July 1998
    
    """
    def feed(self, handle, consumer):
        """feed(self, handle, consumer)

        Feed in Prosite data for scanning.  handle is a file-like
        object that contains prosite data.  consumer is a
        Consumer object that will receive events as the report is scanned.

        """
        if isinstance(handle, File.UndoHandle):
            uhandle = handle
        else:
            uhandle = File.UndoHandle(handle)

        while 1:
            line = uhandle.peekline()
            if not line:
                break
            elif is_blank_line(line):
                # Skip blank lines between records
                uhandle.readline()
                continue
            elif line[:2] == 'ID':
                self._scan_record(uhandle, consumer)
            elif line[:2] == 'CC':
                self._scan_copyrights(uhandle, consumer)
            else:
                raise SyntaxError, "There doesn't appear to be a record"

    def _scan_copyrights(self, uhandle, consumer):
        consumer.start_copyrights()
        self._scan_line('CC', uhandle, consumer.copyright, any_number=1)
        self._scan_terminator(uhandle, consumer)
        consumer.end_copyrights()

    def _scan_record(self, uhandle, consumer):
        consumer.start_record()
        for fn in self._scan_fns:
            fn(self, uhandle, consumer)

            # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before
            # the 3D lines, instead of the other way around.
            # Thus, I'll give the 3D lines another chance after the DO lines
            # are finished.
            if fn is self._scan_do.im_func:
                self._scan_3d(uhandle, consumer)
        consumer.end_record()

    def _scan_line(self, line_type, uhandle, event_fn,
                   exactly_one=None, one_or_more=None, any_number=None,
                   up_to_one=None):
        # Callers must set exactly one of exactly_one, one_or_more, or
        # any_number to a true value.  I do not explicitly check to
        # make sure this function is called correctly.
        
        # This does not guarantee any parameter safety, but I
        # like the readability.  The other strategy I tried was have
        # parameters min_lines, max_lines.
        
        if exactly_one or one_or_more:
            read_and_call(uhandle, event_fn, start=line_type)
        if one_or_more or any_number:
            while 1:
                if not attempt_read_and_call(uhandle, event_fn,
                                             start=line_type):
                    break
        if up_to_one:
            attempt_read_and_call(uhandle, event_fn, start=line_type)

    def _scan_id(self, uhandle, consumer):
        self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)

    def _scan_ac(self, uhandle, consumer):
        self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
    
    def _scan_dt(self, uhandle, consumer):
        self._scan_line('DT', uhandle, consumer.date, exactly_one=1)

    def _scan_de(self, uhandle, consumer):
        self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
    
    def _scan_pa(self, uhandle, consumer):
        self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
    
    def _scan_ma(self, uhandle, consumer):
        # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15
        # contain a CC line buried within an 'MA' line.  Need to check
        # for that.
        while 1:
            if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'):
                line1 = uhandle.readline()
                line2 = uhandle.readline()
                uhandle.saveline(line2)
                uhandle.saveline(line1)
                if line1[:2] == 'CC' and line2[:2] == 'MA':
                    read_and_call(uhandle, consumer.comment, start='CC')
                else:
                    break
    
    def _scan_ru(self, uhandle, consumer):
        self._scan_line('RU', uhandle, consumer.rule, any_number=1)
    
    def _scan_nr(self, uhandle, consumer):
        self._scan_line('NR', uhandle, consumer.numerical_results,
                        any_number=1)

    def _scan_cc(self, uhandle, consumer):
        self._scan_line('CC', uhandle, consumer.comment, any_number=1)
    
    def _scan_dr(self, uhandle, consumer):
        self._scan_line('DR', uhandle, consumer.database_reference,
                        any_number=1)
    
    def _scan_3d(self, uhandle, consumer):
        self._scan_line('3D', uhandle, consumer.pdb_reference,
                        any_number=1)
    
    def _scan_do(self, uhandle, consumer):
        self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)

    def _scan_terminator(self, uhandle, consumer):
        self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)

    _scan_fns = [
        _scan_id,
        _scan_ac,
        _scan_dt,
        _scan_de,
        _scan_pa,
        _scan_ma,
        _scan_ru,
        _scan_nr,
        _scan_ma,           ## (lambrecht/dyoo)  is this right?
        _scan_nr,           ## (lambrecht/dyoo)  is this right?
        _scan_cc,
        _scan_dr,
        _scan_3d,
        _scan_do,
        _scan_terminator
        ]

class _RecordConsumer(AbstractConsumer):
    """Consumer that converts a Prosite record to a Record object.

    Members:
    data    Record with Prosite data.

    """
    def __init__(self):
        self.data = None
        
    def start_record(self):
        self.data = Record()
        
    def end_record(self):
        self._clean_record(self.data)

    def identification(self, line):
        cols = string.split(line)
        if len(cols) != 3:
            raise SyntaxError, "I don't understand identification line\n%s" % \
                  line
        self.data.name = self._chomp(cols[1])    # don't want ';'
        self.data.type = self._chomp(cols[2])    # don't want '.'
    
    def accession(self, line):
        cols = string.split(line)
        if len(cols) != 2:
            raise SyntaxError, "I don't understand accession line\n%s" % line
        self.data.accession = self._chomp(cols[1])
    
    def date(self, line):
        uprline = string.upper(line)
        cols = string.split(uprline)

        # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE'
        if cols[2] != '(CREATED);' or \
           cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \
           cols[7][:4] != '(INF' or cols[8] != 'UPDATE).':
            raise SyntaxError, "I don't understand date line\n%s" % line

        self.data.created = cols[1]
        self.data.data_update = cols[3]
        self.data.info_update = cols[6]
    
    def description(self, line):
        self.data.description = self._clean(line)
    
    def pattern(self, line):
        self.data.pattern = self.data.pattern + self._clean(line)
    
    def matrix(self, line):
        self.data.matrix.append(self._clean(line))
    
    def rule(self, line):
        self.data.rules.append(self._clean(line))
    
    def numerical_results(self, line):
        cols = string.split(self._clean(line), ';')
        for col in cols:
            if not col:
                continue
            qual, data = map(string.lstrip, string.split(col, '='))
            if qual == '/RELEASE':
                release, seqs = string.split(data, ',')
                self.data.nr_sp_release = release
                self.data.nr_sp_seqs = int(seqs)
            elif qual == '/FALSE_NEG':
                self.data.nr_false_neg = int(data)
            elif qual == '/PARTIAL':
                self.data.nr_partial = int(data)
                ## (lambrecht/dyoo)   added temporary fix for qual //MATRIX_TYPE in CC
            elif qual =='/MATRIX_TYPE':
                pass
            elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']:
                m = re.match(r'(\d+)\((\d+)\)', data)
                if not m:
                    raise error, "Broken data %s in comment line\n%s" % \
                          (repr(data), line)
                hits = tuple(map(int, m.groups()))
                if(qual == "/TOTAL"):
                    self.data.nr_total = hits
                elif(qual == "/POSITIVE"):
                    self.data.nr_positive = hits
                elif(qual == "/UNKNOWN"):
                    self.data.nr_unknown = hits
                elif(qual == "/FALSE_POS"):
                    self.data.nr_false_pos = hits
            else:
                raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
                      (repr(qual), line)
    
    def comment(self, line):
        cols = string.split(self._clean(line), ';')
        for col in cols:
            # DNAJ_2 in Release 15 has a non-standard comment line:
            # CC   Automatic scaling using reversed database
            # Throw it away.  (Should I keep it?)
            if not col or col[:17] == 'Automatic scaling':
                continue
            qual, data = map(string.lstrip, string.split(col, '='))
            if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR',
                        '/FT_KEY', '/FT_DESC'):
                continue  ## (lambrecht/dyoo) This is a temporary fix until we know what
                          ## to do here
            if qual == '/TAXO-RANGE':
                self.data.cc_taxo_range = data
            elif qual == '/MAX-REPEAT':
                self.data.cc_max_repeat = data
            elif qual == '/SITE':
                pos, desc = string.split(data, ',')
                self.data.cc_site = (int(pos), desc)
            elif qual == '/SKIP-FLAG':
                self.data.cc_skip_flag = data
            else:
                raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
                      (repr(qual), line)
            
    def database_reference(self, line):
        refs = string.split(self._clean(line), ';')
        for ref in refs:
            if not ref:
                continue
            acc, name, type = map(string.strip, string.split(ref, ','))
            if type == 'T':
                self.data.dr_positive.append((acc, name))
            elif type == 'F':
                self.data.dr_false_pos.append((acc, name))
            elif type == 'N':
                self.data.dr_false_neg.append((acc, name))
            elif type == 'P':
                self.data.dr_potential.append((acc, name))
            elif type == '?':
                self.data.dr_unknown.append((acc, name))
            else:
                raise SyntaxError, "I don't understand type flag %s" % type
    
    def pdb_reference(self, line):
        cols = string.split(line)
        for id in cols[1:]:  # get all but the '3D' col
            self.data.pdb_structs.append(self._chomp(id))
    
    def documentation(self, line):
        self.data.pdoc = self._chomp(self._clean(line))

    def terminator(self, line):
        pass

    def _chomp(self, word, to_chomp='.,;'):
        # Remove the punctuation at the end of a word.
        if word[-1] in to_chomp:
            return word[:-1]
        return word

    def _clean(self, line, rstrip=1):
        # Clean up a line.
        if rstrip:
            return string.rstrip(line[5:])
        return line[5:]

def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
    """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) ->
    list of PatternHit's

    Search a sequence for occurrences of Prosite patterns.  You can
    specify either a sequence in seq or a SwissProt/trEMBL ID or accession
    in id.  Only one of those should be given.  If exclude_frequent
    is true, then the patterns with the high probability of occurring
    will be excluded.

    """
    if (seq and id) or not (seq or id):
        raise ValueError, "Please specify either a sequence or an id"
    handle = ExPASy.scanprosite1(seq, id, exclude_frequent)
    return _extract_pattern_hits(handle)

def _extract_pattern_hits(handle):
    """_extract_pattern_hits(handle) -> list of PatternHit's

    Extract hits from a web page.  Raises a ValueError if there
    was an error in the query.

    """
    class parser(sgmllib.SGMLParser):
        def __init__(self):
            sgmllib.SGMLParser.__init__(self)
            self.hits = []
            self.broken_message = 'Some error occurred'
            self._in_pre = 0
            self._current_hit = None
            self._last_found = None   # Save state of parsing
        def handle_data(self, data):
            if string.find(data, 'try again') >= 0:
                self.broken_message = data
                return
            elif data == 'illegal':
                self.broken_message = 'Sequence contains illegal characters'
                return
            if not self._in_pre:
                return
            elif not string.strip(data):
                return
            if self._last_found is None and data[:4] == 'PDOC':
                self._current_hit.pdoc = data
                self._last_found = 'pdoc'
            elif self._last_found == 'pdoc':
                if data[:2] != 'PS':
                    raise SyntaxError, "Expected accession but got:\n%s" % data
                self._current_hit.accession = data
                self._last_found = 'accession'
            elif self._last_found == 'accession':
                self._current_hit.name = data
                self._last_found = 'name'
            elif self._last_found == 'name':
                self._current_hit.description = data
                self._last_found = 'description'
            elif self._last_found == 'description':
                m = re.findall(r'(\d+)-(\d+) (\w+)', data)
                for start, end, seq in m:
                    self._current_hit.matches.append(
                        (int(start), int(end), seq))
            
        def do_hr(self, attrs):
            # <HR> inside a <PRE> section means a new hit.
            if self._in_pre:
                self._current_hit = PatternHit()
                self.hits.append(self._current_hit)
                self._last_found = None
        def start_pre(self, attrs):
            self._in_pre = 1
            self.broken_message = None   # Probably not broken
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if p.broken_message:
        raise ValueError, p.broken_message
    return p.hits


def index_file(filename, indexname, rec2key=None):
    """index_file(filename, indexname, rec2key=None)

    Index a Prosite file.  filename is the name of the file.
    indexname is the name of the dictionary.  rec2key is an
    optional callback that takes a Record and generates a unique key
    (e.g. the accession number) for the record.  If not specified,
    the id name will be used.

    """
    if not os.path.exists(filename):
        raise ValueError, "%s does not exist" % filename

    index = Index.Index(indexname, truncate=1)
    index[Dictionary._Dictionary__filename_key] = filename
    
    iter = Iterator(open(filename), parser=RecordParser())
    while 1:
        start = iter._uhandle.tell()
        rec = iter.next()
        length = iter._uhandle.tell() - start
        
        if rec is None:
            break
        if rec2key is not None:
            key = rec2key(rec)
        else:
            key = rec.name
            
        if not key:
            raise KeyError, "empty key was produced"
        elif index.has_key(key):
            raise KeyError, "duplicate key %s found" % key

        index[key] = start, length

def _extract_record(handle):
    """_extract_record(handle) -> str

    Extract PROSITE data from a web page.  Raises a ValueError if no
    data was found in the web page.

    """
    # All the data appears between tags:
    # <pre width = 80>ID   NIR_SIR; PATTERN.
    # </PRE>
    class parser(sgmllib.SGMLParser):
        def __init__(self):
            sgmllib.SGMLParser.__init__(self)
            self._in_pre = 0
            self.data = []
        def handle_data(self, data):
            if self._in_pre:
                self.data.append(data)
        def do_br(self, attrs):
            if self._in_pre:
                self.data.append('\n')
        def start_pre(self, attrs):
            self._in_pre = 1
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if not p.data:
        raise ValueError, "No data found in web page."
    return string.join(p.data, '')

From jchang at smi.stanford.edu  Wed Jan 23 16:27:31 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Bug in urllib of macpython
In-Reply-To: <B872E9EE.1B31%Y.Benita@pharm.uu.nl>
References: <B872E9EE.1B31%Y.Benita@pharm.uu.nl>
Message-ID: <20020123132731.B578@krusty.stanford.edu>

Doing a quick grep through the .py files in the current CVS, it looks
like the only other file to use it is FormatIO.

What is your workaround?

Jeff


On Tue, Jan 22, 2002 at 10:05:34AM +0100, Yair Benita wrote:
> Due to a bug in urllib of biopython most of the www are not functioning
> well. The bug is known and will be taken care of by the python developers.
> However, I had to use another function to replace the command
> urllib.urlopen()
> 
> The following files were changed:
> NCBI.py
> ExPAsy.py
> InterPro.py
> SCOP.py
> 
> Besides these files, is there any other place where the urllib.urlopen
> command is used?
> 
> Yair
> -- 
> Yair Benita
> Pharmaceutical Proteomics
> Utrecht University
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev

From Y.Benita at pharm.uu.nl  Thu Jan 24 04:17:08 2002
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Bug in urllib of macpython
In-Reply-To: <20020123132731.B578@krusty.stanford.edu>
Message-ID: <B8758FA3.1B79%Y.Benita@pharm.uu.nl>

on 23/1/2002 22:27, Jeffrey Chang at jchang@smi.stanford.edu wrote:

> Doing a quick grep through the .py files in the current CVS, it looks
> like the only other file to use it is FormatIO.
> 
> What is your workaround?

In the begging I had a while loop that waits till the file is fully
downloaded. Now I have an easier solution:

Instead of:
handle = urllib.urlopen(fullcgi)

I use:
handle = open(urllib.urlretrieve(fullcgi)[0])

It appears to work fine now.

Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From jchang at smi.stanford.edu  Thu Jan 24 17:28:01 2002
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Prosite
In-Reply-To: <Pine.GSO.4.33.0201231224110.4043-200000@acoma.Stanford.EDU>
References: <Pine.GSO.4.33.0201231224110.4043-200000@acoma.Stanford.EDU>
Message-ID: <20020124142801.D372@krusty.stanford.edu>

Yep, it looks like Release 17 from last month introduced some format
changes that broke the parser.  I've updated the parser to handle the
new lines -- __init__.py is attached.  Please try this out and let me
know how it works.  Thanks for the report and the patch!

Jeff


On Wed, Jan 23, 2002 at 12:36:51PM -0800, Mark Lambrecht wrote:
> Hi,
> 
> Thanks for all the excellent Biopython code.
> I used the Prosite parser and it breaks on a number of CC and MA lines.
> Maybe there is a new version of the prosite.dat file ?
> We added some code to the Bio/Prosite/__init__.py , and commented it with
> ## (lambrecht/dyoo)
> Then everything works again but possibly doesn't use the information in
> these lines.
> I attached the __init__.py
> Could you take a look ?
> 
> Thanks !!
> 
> Mark
> 
> 
> --------------------------------------------------------------------------
> Mark Lambrecht
> Postdoctoral Research Fellow
> The Arabidopsis Information Resource	FAX: (650) 325-6857
> Carnegie Institution of Washington	Tel: (650) 325-1521 ext.397
> Department of Plant Biology		URL: http://arabidopsis.org/
> 260 Panama St.
> Stanford, CA 94305
> --------------------------------------------------------------------------

> # Copyright 1999 by Jeffrey Chang.  All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license.  Please see the LICENSE file that should have been included
> # as part of this package.
> 
> # Copyright 2000 by Jeffrey Chang.  All rights reserved.
> # This code is part of the Biopython distribution and governed by its
> # license.  Please see the LICENSE file that should have been included
> # as part of this package.
> 
> """Prosite
> 
> This module provides code to work with the prosite.dat file from
> Prosite.
> http://www.expasy.ch/prosite/
> 
> Tested with:
> Release 15.0, July 1998
> Release 16.0, July 1999
> 
> 
> Classes:
> Record                Holds Prosite data.
> PatternHit            Holds data from a hit against a Prosite pattern.
> Iterator              Iterates over entries in a Prosite file.
> Dictionary            Accesses a Prosite file using a dictionary interface.
> ExPASyDictionary      Accesses Prosite records from ExPASy.
> RecordParser          Parses a Prosite record into a Record object.
> 
> _Scanner              Scans Prosite-formatted data.
> _RecordConsumer       Consumes Prosite data to a Record object.
> 
> 
> Functions:
> scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns.
> index_file            Index a Prosite file for a Dictionary.
> _extract_record       Extract Prosite data from a web page.
> _extract_pattern_hits Extract Prosite patterns from a web page.
> 
> """
> __all__ = [
>     'Pattern',
>     'Prodoc',
>     ]
> from types import *
> import string
> import re
> import sgmllib
> from Bio import File
> from Bio import Index
> from Bio.ParserSupport import *
> from Bio.WWW import ExPASy
> from Bio.WWW import RequestLimiter
> 
> class Record:
>     """Holds information from a Prosite record.
> 
>     Members:
>     name           ID of the record.  e.g. ADH_ZINC
>     type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
>     accession      e.g. PS00387
>     created        Date the entry was created.  (MMM-YYYY)
>     data_update    Date the 'primary' data was last updated.
>     info_update    Date data other than 'primary' data was last updated.
>     pdoc           ID of the PROSITE DOCumentation.
>     
>     description    Free-format description.
>     pattern        The PROSITE pattern.  See docs.
>     matrix         List of strings that describes a matrix entry.
>     rules          List of rule definitions.  (strings)
> 
>     NUMERICAL RESULTS
>     nr_sp_release  SwissProt release.
>     nr_sp_seqs     Number of seqs in that release of Swiss-Prot. (int)
>     nr_total       Number of hits in Swiss-Prot.  tuple of (hits, seqs)
>     nr_positive    True positives.  tuple of (hits, seqs)
>     nr_unknown     Could be positives.  tuple of (hits, seqs)
>     nr_false_pos   False positives.  tuple of (hits, seqs)
>     nr_false_neg   False negatives.  (int)
>     nr_partial     False negatives, because they are fragments. (int)
> 
>     COMMENTS
>     cc_taxo_range  Taxonomic range.  See docs for format
>     cc_max_repeat  Maximum number of repetitions in a protein
>     cc_site        Interesting site.  list of tuples (pattern pos, desc.)
>     cc_skip_flag   Can this entry be ignored?
> 
>     DATA BANK REFERENCES - The following are all
>                            lists of tuples (swiss-prot accession,
>                                             swiss-prot name)
>     dr_positive
>     dr_false_neg
>     dr_false_pos
>     dr_potential   Potential hits, but fingerprint region not yet available.
>     dr_unknown     Could possibly belong
> 
>     pdb_structs    List of PDB entries.
> 
>     """
>     def __init__(self):
>         self.name = ''
>         self.type = ''
>         self.accession = ''
>         self.created = ''
>         self.data_update = ''
>         self.info_update = ''
>         self.pdoc = ''
>     
>         self.description = ''
>         self.pattern = ''
>         self.matrix = []
>         self.rules = []
> 
>         self.nr_sp_release = ''
>         self.nr_sp_seqs = ''
>         self.nr_total = (None, None)
>         self.nr_positive = (None, None)
>         self.nr_unknown = (None, None)
>         self.nr_false_pos = (None, None)
>         self.nr_false_neg = None
>         self.nr_partial = None
> 
>         self.cc_taxo_range = ''
>         self.cc_max_repeat = ''
>         self.cc_site = []
>         self.cc_skip_flag = ''
> 
>         self.dr_positive = []
>         self.dr_false_neg = []
>         self.dr_false_pos = []
>         self.dr_potential = []
>         self.dr_unknown = []
> 
>         self.pdb_structs = []
> 
> class PatternHit:
>     """Holds information from a hit against a Prosite pattern.
> 
>     Members:
>     name           ID of the record.  e.g. ADH_ZINC
>     accession      e.g. PS00387
>     pdoc           ID of the PROSITE DOCumentation.
>     description    Free-format description.
>     matches        List of tuples (start, end, sequence) where
>                    start and end are indexes of the match, and sequence is
>                    the sequence matched.
> 
>     """
>     def __init__(self):
>         self.name = None
>         self.accession = None
>         self.pdoc = None
>         self.description = None
>         self.matches = []
>     def __str__(self):
>         lines = []
>         lines.append("%s %s %s" % (self.accession, self.pdoc, self.name))
>         lines.append(self.description)
>         lines.append('')
>         if len(self.matches) > 1:
>             lines.append("Number of matches: %s" % len(self.matches))
>         for i in range(len(self.matches)):
>             start, end, seq = self.matches[i]
>             range_str = "%d-%d" % (start, end)
>             if len(self.matches) > 1:
>                 lines.append("%7d %10s %s" % (i+1, range_str, seq))
>             else:
>                 lines.append("%7s %10s %s" % (' ', range_str, seq))
>         return string.join(lines, '\n')
>     
> class Iterator:
>     """Returns one record at a time from a Prosite file.
> 
>     Methods:
>     next   Return the next record from the stream, or None.
> 
>     """
>     def __init__(self, handle, parser=None):
>         """__init__(self, handle, parser=None)
> 
>         Create a new iterator.  handle is a file-like object.  parser
>         is an optional Parser object to change the results into another form.
>         If set to None, then the raw contents of the file will be returned.
> 
>         """
>         if type(handle) is not FileType and type(handle) is not InstanceType:
>             raise ValueError, "I expected a file handle or file-like object"
>         self._uhandle = File.UndoHandle(handle)
>         self._parser = parser
> 
>     def next(self):
>         """next(self) -> object
> 
>         Return the next Prosite record from the file.  If no more records,
>         return None.
> 
>         """
>         # Skip the copyright info, if it's the first record.
>         line = self._uhandle.peekline()
>         if line[:2] == 'CC':
>             while 1:
>                 line = self._uhandle.readline()
>                 if not line:
>                     break
>                 if line[:2] == '//':
>                     break
>                 if line[:2] != 'CC':
>                     raise SyntaxError, \
>                           "Oops, where's the copyright?"
>         
>         lines = []
>         while 1:
>             line = self._uhandle.readline()
>             if not line:
>                 break
>             lines.append(line)
>             if line[:2] == '//':
>                 break
>             
>         if not lines:
>             return None
>             
>         data = string.join(lines, '')
>         if self._parser is not None:
>             return self._parser.parse(File.StringHandle(data))
>         return data
> 
> class Dictionary:
>     """Accesses a Prosite file using a dictionary interface.
> 
>     """
>     __filename_key = '__filename'
>     
>     def __init__(self, indexname, parser=None):
>         """__init__(self, indexname, parser=None)
> 
>         Open a Prosite Dictionary.  indexname is the name of the
>         index for the dictionary.  The index should have been created
>         using the index_file function.  parser is an optional Parser
>         object to change the results into another form.  If set to None,
>         then the raw contents of the file will be returned.
> 
>         """
>         self._index = Index.Index(indexname)
>         self._handle = open(self._index[Dictionary.__filename_key])
>         self._parser = parser
> 
>     def __len__(self):
>         return len(self._index)
> 
>     def __getitem__(self, key):
>         start, len = self._index[key]
>         self._handle.seek(start)
>         data = self._handle.read(len)
>         if self._parser is not None:
>             return self._parser.parse(File.StringHandle(data))
>         return data
> 
>     def __getattr__(self, name):
>         return getattr(self._index, name)
> 
> class ExPASyDictionary:
>     """Access PROSITE at ExPASy using a read-only dictionary interface.
> 
>     """
>     def __init__(self, delay=5.0, parser=None):
>         """__init__(self, delay=5.0, parser=None)
> 
>         Create a new Dictionary to access PROSITE.  parser is an optional
>         parser (e.g. Prosite.RecordParser) object to change the results
>         into another form.  If set to None, then the raw contents of the
>         file will be returned.  delay is the number of seconds to wait
>         between each query.
> 
>         """
>         self.parser = parser
>         self.limiter = RequestLimiter(delay)
> 
>     def __len__(self):
>         raise NotImplementedError, "Prosite contains lots of entries"
>     def clear(self):
>         raise NotImplementedError, "This is a read-only dictionary"
>     def __setitem__(self, key, item):
>         raise NotImplementedError, "This is a read-only dictionary"
>     def update(self):
>         raise NotImplementedError, "This is a read-only dictionary"
>     def copy(self):
>         raise NotImplementedError, "You don't need to do this..."
>     def keys(self):
>         raise NotImplementedError, "You don't really want to do this..."
>     def items(self):
>         raise NotImplementedError, "You don't really want to do this..."
>     def values(self):
>         raise NotImplementedError, "You don't really want to do this..."
>     
>     def has_key(self, id):
>         """has_key(self, id) -> bool"""
>         try:
>             self[id]
>         except KeyError:
>             return 0
>         return 1
> 
>     def get(self, id, failobj=None):
>         try:
>             return self[id]
>         except KeyError:
>             return failobj
>         raise "How did I get here?"
> 
>     def __getitem__(self, id):
>         """__getitem__(self, id) -> object
> 
>         Return a Prosite entry.  id is either the id or accession
>         for the entry.  Raises a KeyError if there's an error.
>         
>         """
>         # First, check to see if enough time has passed since my
>         # last query.
>         self.limiter.wait()
> 
>         try:
>             handle = ExPASy.get_prosite_entry(id)
>         except IOError:
>             raise KeyError, id
>         try:
>             handle = File.StringHandle(_extract_record(handle))
>         except ValueError:
>             raise KeyError, id
>         
>         if self.parser is not None:
>             return self.parser.parse(handle)
>         return handle.read()
> 
> class RecordParser(AbstractParser):
>     """Parses Prosite data into a Record object.
> 
>     """
>     def __init__(self):
>         self._scanner = _Scanner()
>         self._consumer = _RecordConsumer()
> 
>     def parse(self, handle):
>         self._scanner.feed(handle, self._consumer)
>         return self._consumer.data
> 
> class _Scanner:
>     """Scans Prosite-formatted data.
> 
>     Tested with:
>     Release 15.0, July 1998
>     
>     """
>     def feed(self, handle, consumer):
>         """feed(self, handle, consumer)
> 
>         Feed in Prosite data for scanning.  handle is a file-like
>         object that contains prosite data.  consumer is a
>         Consumer object that will receive events as the report is scanned.
> 
>         """
>         if isinstance(handle, File.UndoHandle):
>             uhandle = handle
>         else:
>             uhandle = File.UndoHandle(handle)
> 
>         while 1:
>             line = uhandle.peekline()
>             if not line:
>                 break
>             elif is_blank_line(line):
>                 # Skip blank lines between records
>                 uhandle.readline()
>                 continue
>             elif line[:2] == 'ID':
>                 self._scan_record(uhandle, consumer)
>             elif line[:2] == 'CC':
>                 self._scan_copyrights(uhandle, consumer)
>             else:
>                 raise SyntaxError, "There doesn't appear to be a record"
> 
>     def _scan_copyrights(self, uhandle, consumer):
>         consumer.start_copyrights()
>         self._scan_line('CC', uhandle, consumer.copyright, any_number=1)
>         self._scan_terminator(uhandle, consumer)
>         consumer.end_copyrights()
> 
>     def _scan_record(self, uhandle, consumer):
>         consumer.start_record()
>         for fn in self._scan_fns:
>             fn(self, uhandle, consumer)
> 
>             # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before
>             # the 3D lines, instead of the other way around.
>             # Thus, I'll give the 3D lines another chance after the DO lines
>             # are finished.
>             if fn is self._scan_do.im_func:
>                 self._scan_3d(uhandle, consumer)
>         consumer.end_record()
> 
>     def _scan_line(self, line_type, uhandle, event_fn,
>                    exactly_one=None, one_or_more=None, any_number=None,
>                    up_to_one=None):
>         # Callers must set exactly one of exactly_one, one_or_more, or
>         # any_number to a true value.  I do not explicitly check to
>         # make sure this function is called correctly.
>         
>         # This does not guarantee any parameter safety, but I
>         # like the readability.  The other strategy I tried was have
>         # parameters min_lines, max_lines.
>         
>         if exactly_one or one_or_more:
>             read_and_call(uhandle, event_fn, start=line_type)
>         if one_or_more or any_number:
>             while 1:
>                 if not attempt_read_and_call(uhandle, event_fn,
>                                              start=line_type):
>                     break
>         if up_to_one:
>             attempt_read_and_call(uhandle, event_fn, start=line_type)
> 
>     def _scan_id(self, uhandle, consumer):
>         self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
> 
>     def _scan_ac(self, uhandle, consumer):
>         self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
>     
>     def _scan_dt(self, uhandle, consumer):
>         self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
> 
>     def _scan_de(self, uhandle, consumer):
>         self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
>     
>     def _scan_pa(self, uhandle, consumer):
>         self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
>     
>     def _scan_ma(self, uhandle, consumer):
>         # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15
>         # contain a CC line buried within an 'MA' line.  Need to check
>         # for that.
>         while 1:
>             if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'):
>                 line1 = uhandle.readline()
>                 line2 = uhandle.readline()
>                 uhandle.saveline(line2)
>                 uhandle.saveline(line1)
>                 if line1[:2] == 'CC' and line2[:2] == 'MA':
>                     read_and_call(uhandle, consumer.comment, start='CC')
>                 else:
>                     break
>     
>     def _scan_ru(self, uhandle, consumer):
>         self._scan_line('RU', uhandle, consumer.rule, any_number=1)
>     
>     def _scan_nr(self, uhandle, consumer):
>         self._scan_line('NR', uhandle, consumer.numerical_results,
>                         any_number=1)
> 
>     def _scan_cc(self, uhandle, consumer):
>         self._scan_line('CC', uhandle, consumer.comment, any_number=1)
>     
>     def _scan_dr(self, uhandle, consumer):
>         self._scan_line('DR', uhandle, consumer.database_reference,
>                         any_number=1)
>     
>     def _scan_3d(self, uhandle, consumer):
>         self._scan_line('3D', uhandle, consumer.pdb_reference,
>                         any_number=1)
>     
>     def _scan_do(self, uhandle, consumer):
>         self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
> 
>     def _scan_terminator(self, uhandle, consumer):
>         self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
> 
>     _scan_fns = [
>         _scan_id,
>         _scan_ac,
>         _scan_dt,
>         _scan_de,
>         _scan_pa,
>         _scan_ma,
>         _scan_ru,
>         _scan_nr,
>         _scan_ma,           ## (lambrecht/dyoo)  is this right?
>         _scan_nr,           ## (lambrecht/dyoo)  is this right?
>         _scan_cc,
>         _scan_dr,
>         _scan_3d,
>         _scan_do,
>         _scan_terminator
>         ]
> 
> class _RecordConsumer(AbstractConsumer):
>     """Consumer that converts a Prosite record to a Record object.
> 
>     Members:
>     data    Record with Prosite data.
> 
>     """
>     def __init__(self):
>         self.data = None
>         
>     def start_record(self):
>         self.data = Record()
>         
>     def end_record(self):
>         self._clean_record(self.data)
> 
>     def identification(self, line):
>         cols = string.split(line)
>         if len(cols) != 3:
>             raise SyntaxError, "I don't understand identification line\n%s" % \
>                   line
>         self.data.name = self._chomp(cols[1])    # don't want ';'
>         self.data.type = self._chomp(cols[2])    # don't want '.'
>     
>     def accession(self, line):
>         cols = string.split(line)
>         if len(cols) != 2:
>             raise SyntaxError, "I don't understand accession line\n%s" % line
>         self.data.accession = self._chomp(cols[1])
>     
>     def date(self, line):
>         uprline = string.upper(line)
>         cols = string.split(uprline)
> 
>         # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE'
>         if cols[2] != '(CREATED);' or \
>            cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \
>            cols[7][:4] != '(INF' or cols[8] != 'UPDATE).':
>             raise SyntaxError, "I don't understand date line\n%s" % line
> 
>         self.data.created = cols[1]
>         self.data.data_update = cols[3]
>         self.data.info_update = cols[6]
>     
>     def description(self, line):
>         self.data.description = self._clean(line)
>     
>     def pattern(self, line):
>         self.data.pattern = self.data.pattern + self._clean(line)
>     
>     def matrix(self, line):
>         self.data.matrix.append(self._clean(line))
>     
>     def rule(self, line):
>         self.data.rules.append(self._clean(line))
>     
>     def numerical_results(self, line):
>         cols = string.split(self._clean(line), ';')
>         for col in cols:
>             if not col:
>                 continue
>             qual, data = map(string.lstrip, string.split(col, '='))
>             if qual == '/RELEASE':
>                 release, seqs = string.split(data, ',')
>                 self.data.nr_sp_release = release
>                 self.data.nr_sp_seqs = int(seqs)
>             elif qual == '/FALSE_NEG':
>                 self.data.nr_false_neg = int(data)
>             elif qual == '/PARTIAL':
>                 self.data.nr_partial = int(data)
>                 ## (lambrecht/dyoo)   added temporary fix for qual //MATRIX_TYPE in CC
>             elif qual =='/MATRIX_TYPE':
>                 pass
>             elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']:
>                 m = re.match(r'(\d+)\((\d+)\)', data)
>                 if not m:
>                     raise error, "Broken data %s in comment line\n%s" % \
>                           (repr(data), line)
>                 hits = tuple(map(int, m.groups()))
>                 if(qual == "/TOTAL"):
>                     self.data.nr_total = hits
>                 elif(qual == "/POSITIVE"):
>                     self.data.nr_positive = hits
>                 elif(qual == "/UNKNOWN"):
>                     self.data.nr_unknown = hits
>                 elif(qual == "/FALSE_POS"):
>                     self.data.nr_false_pos = hits
>             else:
>                 raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
>                       (repr(qual), line)
>     
>     def comment(self, line):
>         cols = string.split(self._clean(line), ';')
>         for col in cols:
>             # DNAJ_2 in Release 15 has a non-standard comment line:
>             # CC   Automatic scaling using reversed database
>             # Throw it away.  (Should I keep it?)
>             if not col or col[:17] == 'Automatic scaling':
>                 continue
>             qual, data = map(string.lstrip, string.split(col, '='))
>             if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR',
>                         '/FT_KEY', '/FT_DESC'):
>                 continue  ## (lambrecht/dyoo) This is a temporary fix until we know what
>                           ## to do here
>             if qual == '/TAXO-RANGE':
>                 self.data.cc_taxo_range = data
>             elif qual == '/MAX-REPEAT':
>                 self.data.cc_max_repeat = data
>             elif qual == '/SITE':
>                 pos, desc = string.split(data, ',')
>                 self.data.cc_site = (int(pos), desc)
>             elif qual == '/SKIP-FLAG':
>                 self.data.cc_skip_flag = data
>             else:
>                 raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
>                       (repr(qual), line)
>             
>     def database_reference(self, line):
>         refs = string.split(self._clean(line), ';')
>         for ref in refs:
>             if not ref:
>                 continue
>             acc, name, type = map(string.strip, string.split(ref, ','))
>             if type == 'T':
>                 self.data.dr_positive.append((acc, name))
>             elif type == 'F':
>                 self.data.dr_false_pos.append((acc, name))
>             elif type == 'N':
>                 self.data.dr_false_neg.append((acc, name))
>             elif type == 'P':
>                 self.data.dr_potential.append((acc, name))
>             elif type == '?':
>                 self.data.dr_unknown.append((acc, name))
>             else:
>                 raise SyntaxError, "I don't understand type flag %s" % type
>     
>     def pdb_reference(self, line):
>         cols = string.split(line)
>         for id in cols[1:]:  # get all but the '3D' col
>             self.data.pdb_structs.append(self._chomp(id))
>     
>     def documentation(self, line):
>         self.data.pdoc = self._chomp(self._clean(line))
> 
>     def terminator(self, line):
>         pass
> 
>     def _chomp(self, word, to_chomp='.,;'):
>         # Remove the punctuation at the end of a word.
>         if word[-1] in to_chomp:
>             return word[:-1]
>         return word
> 
>     def _clean(self, line, rstrip=1):
>         # Clean up a line.
>         if rstrip:
>             return string.rstrip(line[5:])
>         return line[5:]
> 
> def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
>     """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) ->
>     list of PatternHit's
> 
>     Search a sequence for occurrences of Prosite patterns.  You can
>     specify either a sequence in seq or a SwissProt/trEMBL ID or accession
>     in id.  Only one of those should be given.  If exclude_frequent
>     is true, then the patterns with the high probability of occurring
>     will be excluded.
> 
>     """
>     if (seq and id) or not (seq or id):
>         raise ValueError, "Please specify either a sequence or an id"
>     handle = ExPASy.scanprosite1(seq, id, exclude_frequent)
>     return _extract_pattern_hits(handle)
> 
> def _extract_pattern_hits(handle):
>     """_extract_pattern_hits(handle) -> list of PatternHit's
> 
>     Extract hits from a web page.  Raises a ValueError if there
>     was an error in the query.
> 
>     """
>     class parser(sgmllib.SGMLParser):
>         def __init__(self):
>             sgmllib.SGMLParser.__init__(self)
>             self.hits = []
>             self.broken_message = 'Some error occurred'
>             self._in_pre = 0
>             self._current_hit = None
>             self._last_found = None   # Save state of parsing
>         def handle_data(self, data):
>             if string.find(data, 'try again') >= 0:
>                 self.broken_message = data
>                 return
>             elif data == 'illegal':
>                 self.broken_message = 'Sequence contains illegal characters'
>                 return
>             if not self._in_pre:
>                 return
>             elif not string.strip(data):
>                 return
>             if self._last_found is None and data[:4] == 'PDOC':
>                 self._current_hit.pdoc = data
>                 self._last_found = 'pdoc'
>             elif self._last_found == 'pdoc':
>                 if data[:2] != 'PS':
>                     raise SyntaxError, "Expected accession but got:\n%s" % data
>                 self._current_hit.accession = data
>                 self._last_found = 'accession'
>             elif self._last_found == 'accession':
>                 self._current_hit.name = data
>                 self._last_found = 'name'
>             elif self._last_found == 'name':
>                 self._current_hit.description = data
>                 self._last_found = 'description'
>             elif self._last_found == 'description':
>                 m = re.findall(r'(\d+)-(\d+) (\w+)', data)
>                 for start, end, seq in m:
>                     self._current_hit.matches.append(
>                         (int(start), int(end), seq))
>             
>         def do_hr(self, attrs):
>             # <HR> inside a <PRE> section means a new hit.
>             if self._in_pre:
>                 self._current_hit = PatternHit()
>                 self.hits.append(self._current_hit)
>                 self._last_found = None
>         def start_pre(self, attrs):
>             self._in_pre = 1
>             self.broken_message = None   # Probably not broken
>         def end_pre(self):
>             self._in_pre = 0
>     p = parser()
>     p.feed(handle.read())
>     if p.broken_message:
>         raise ValueError, p.broken_message
>     return p.hits
> 
> 
>         
>     
> def index_file(filename, indexname, rec2key=None):
>     """index_file(filename, indexname, rec2key=None)
> 
>     Index a Prosite file.  filename is the name of the file.
>     indexname is the name of the dictionary.  rec2key is an
>     optional callback that takes a Record and generates a unique key
>     (e.g. the accession number) for the record.  If not specified,
>     the id name will be used.
> 
>     """
>     if not os.path.exists(filename):
>         raise ValueError, "%s does not exist" % filename
> 
>     index = Index.Index(indexname, truncate=1)
>     index[Dictionary._Dictionary__filename_key] = filename
>     
>     iter = Iterator(open(filename), parser=RecordParser())
>     while 1:
>         start = iter._uhandle.tell()
>         rec = iter.next()
>         length = iter._uhandle.tell() - start
>         
>         if rec is None:
>             break
>         if rec2key is not None:
>             key = rec2key(rec)
>         else:
>             key = rec.name
>             
>         if not key:
>             raise KeyError, "empty key was produced"
>         elif index.has_key(key):
>             raise KeyError, "duplicate key %s found" % key
> 
>         index[key] = start, length
> 
> def _extract_record(handle):
>     """_extract_record(handle) -> str
> 
>     Extract PROSITE data from a web page.  Raises a ValueError if no
>     data was found in the web page.
> 
>     """
>     # All the data appears between tags:
>     # <pre width = 80>ID   NIR_SIR; PATTERN.
>     # </PRE>
>     class parser(sgmllib.SGMLParser):
>         def __init__(self):
>             sgmllib.SGMLParser.__init__(self)
>             self._in_pre = 0
>             self.data = []
>         def handle_data(self, data):
>             if self._in_pre:
>                 self.data.append(data)
>         def do_br(self, attrs):
>             if self._in_pre:
>                 self.data.append('\n')
>         def start_pre(self, attrs):
>             self._in_pre = 1
>         def end_pre(self):
>             self._in_pre = 0
>     p = parser()
>     p.feed(handle.read())
>     if not p.data:
>         raise ValueError, "No data found in web page."
>     return string.join(p.data, '')
> 

-------------- next part --------------
# Copyright 1999 by Jeffrey Chang.  All rights reserved.
# This code is part of the Biopython distribution and governed by its
# license.  Please see the LICENSE file that should have been included
# as part of this package.

# Copyright 2000 by Jeffrey Chang.  All rights reserved.
# This code is part of the Biopython distribution and governed by its
# license.  Please see the LICENSE file that should have been included
# as part of this package.

"""Prosite

This module provides code to work with the prosite.dat file from
Prosite.
http://www.expasy.ch/prosite/

Tested with:
Release 15.0, July 1998
Release 16.0, July 1999
Release 17.0, Dec 2001


Classes:
Record                Holds Prosite data.
PatternHit            Holds data from a hit against a Prosite pattern.
Iterator              Iterates over entries in a Prosite file.
Dictionary            Accesses a Prosite file using a dictionary interface.
ExPASyDictionary      Accesses Prosite records from ExPASy.
RecordParser          Parses a Prosite record into a Record object.

_Scanner              Scans Prosite-formatted data.
_RecordConsumer       Consumes Prosite data to a Record object.


Functions:
scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns.
index_file            Index a Prosite file for a Dictionary.
_extract_record       Extract Prosite data from a web page.
_extract_pattern_hits Extract Prosite patterns from a web page.

"""
__all__ = [
    'Pattern',
    'Prodoc',
    ]
from types import *
import string
import re
import sgmllib
from Bio import File
from Bio import Index
from Bio.ParserSupport import *
from Bio.WWW import ExPASy
from Bio.WWW import RequestLimiter

class Record:
    """Holds information from a Prosite record.

    Members:
    name           ID of the record.  e.g. ADH_ZINC
    type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
    accession      e.g. PS00387
    created        Date the entry was created.  (MMM-YYYY)
    data_update    Date the 'primary' data was last updated.
    info_update    Date data other than 'primary' data was last updated.
    pdoc           ID of the PROSITE DOCumentation.
    
    description    Free-format description.
    pattern        The PROSITE pattern.  See docs.
    matrix         List of strings that describes a matrix entry.
    rules          List of rule definitions.  (strings)

    NUMERICAL RESULTS
    nr_sp_release  SwissProt release.
    nr_sp_seqs     Number of seqs in that release of Swiss-Prot. (int)
    nr_total       Number of hits in Swiss-Prot.  tuple of (hits, seqs)
    nr_positive    True positives.  tuple of (hits, seqs)
    nr_unknown     Could be positives.  tuple of (hits, seqs)
    nr_false_pos   False positives.  tuple of (hits, seqs)
    nr_false_neg   False negatives.  (int)
    nr_partial     False negatives, because they are fragments. (int)

    COMMENTS
    cc_taxo_range  Taxonomic range.  See docs for format
    cc_max_repeat  Maximum number of repetitions in a protein
    cc_site        Interesting site.  list of tuples (pattern pos, desc.)
    cc_skip_flag   Can this entry be ignored?
    cc_matrix_type
    cc_scaling_db
    cc_author
    cc_ft_key
    cc_ft_desc

    DATA BANK REFERENCES - The following are all
                           lists of tuples (swiss-prot accession,
                                            swiss-prot name)
    dr_positive
    dr_false_neg
    dr_false_pos
    dr_potential   Potential hits, but fingerprint region not yet available.
    dr_unknown     Could possibly belong

    pdb_structs    List of PDB entries.

    """
    def __init__(self):
        self.name = ''
        self.type = ''
        self.accession = ''
        self.created = ''
        self.data_update = ''
        self.info_update = ''
        self.pdoc = ''
    
        self.description = ''
        self.pattern = ''
        self.matrix = []
        self.rules = []

        self.nr_sp_release = ''
        self.nr_sp_seqs = ''
        self.nr_total = (None, None)
        self.nr_positive = (None, None)
        self.nr_unknown = (None, None)
        self.nr_false_pos = (None, None)
        self.nr_false_neg = None
        self.nr_partial = None

        self.cc_taxo_range = ''
        self.cc_max_repeat = ''
        self.cc_site = []
        self.cc_skip_flag = ''

        self.dr_positive = []
        self.dr_false_neg = []
        self.dr_false_pos = []
        self.dr_potential = []
        self.dr_unknown = []

        self.pdb_structs = []

class PatternHit:
    """Holds information from a hit against a Prosite pattern.

    Members:
    name           ID of the record.  e.g. ADH_ZINC
    accession      e.g. PS00387
    pdoc           ID of the PROSITE DOCumentation.
    description    Free-format description.
    matches        List of tuples (start, end, sequence) where
                   start and end are indexes of the match, and sequence is
                   the sequence matched.

    """
    def __init__(self):
        self.name = None
        self.accession = None
        self.pdoc = None
        self.description = None
        self.matches = []
    def __str__(self):
        lines = []
        lines.append("%s %s %s" % (self.accession, self.pdoc, self.name))
        lines.append(self.description)
        lines.append('')
        if len(self.matches) > 1:
            lines.append("Number of matches: %s" % len(self.matches))
        for i in range(len(self.matches)):
            start, end, seq = self.matches[i]
            range_str = "%d-%d" % (start, end)
            if len(self.matches) > 1:
                lines.append("%7d %10s %s" % (i+1, range_str, seq))
            else:
                lines.append("%7s %10s %s" % (' ', range_str, seq))
        return string.join(lines, '\n')
    
class Iterator:
    """Returns one record at a time from a Prosite file.

    Methods:
    next   Return the next record from the stream, or None.

    """
    def __init__(self, handle, parser=None):
        """__init__(self, handle, parser=None)

        Create a new iterator.  handle is a file-like object.  parser
        is an optional Parser object to change the results into another form.
        If set to None, then the raw contents of the file will be returned.

        """
        if type(handle) is not FileType and type(handle) is not InstanceType:
            raise ValueError, "I expected a file handle or file-like object"
        self._uhandle = File.UndoHandle(handle)
        self._parser = parser

    def next(self):
        """next(self) -> object

        Return the next Prosite record from the file.  If no more records,
        return None.

        """
        # Skip the copyright info, if it's the first record.
        line = self._uhandle.peekline()
        if line[:2] == 'CC':
            while 1:
                line = self._uhandle.readline()
                if not line:
                    break
                if line[:2] == '//':
                    break
                if line[:2] != 'CC':
                    raise SyntaxError, \
                          "Oops, where's the copyright?"
        
        lines = []
        while 1:
            line = self._uhandle.readline()
            if not line:
                break
            lines.append(line)
            if line[:2] == '//':
                break
            
        if not lines:
            return None
            
        data = string.join(lines, '')
        if self._parser is not None:
            return self._parser.parse(File.StringHandle(data))
        return data

class Dictionary:
    """Accesses a Prosite file using a dictionary interface.

    """
    __filename_key = '__filename'
    
    def __init__(self, indexname, parser=None):
        """__init__(self, indexname, parser=None)

        Open a Prosite Dictionary.  indexname is the name of the
        index for the dictionary.  The index should have been created
        using the index_file function.  parser is an optional Parser
        object to change the results into another form.  If set to None,
        then the raw contents of the file will be returned.

        """
        self._index = Index.Index(indexname)
        self._handle = open(self._index[Dictionary.__filename_key])
        self._parser = parser

    def __len__(self):
        return len(self._index)

    def __getitem__(self, key):
        start, len = self._index[key]
        self._handle.seek(start)
        data = self._handle.read(len)
        if self._parser is not None:
            return self._parser.parse(File.StringHandle(data))
        return data

    def __getattr__(self, name):
        return getattr(self._index, name)

class ExPASyDictionary:
    """Access PROSITE at ExPASy using a read-only dictionary interface.

    """
    def __init__(self, delay=5.0, parser=None):
        """__init__(self, delay=5.0, parser=None)

        Create a new Dictionary to access PROSITE.  parser is an optional
        parser (e.g. Prosite.RecordParser) object to change the results
        into another form.  If set to None, then the raw contents of the
        file will be returned.  delay is the number of seconds to wait
        between each query.

        """
        self.parser = parser
        self.limiter = RequestLimiter(delay)

    def __len__(self):
        raise NotImplementedError, "Prosite contains lots of entries"
    def clear(self):
        raise NotImplementedError, "This is a read-only dictionary"
    def __setitem__(self, key, item):
        raise NotImplementedError, "This is a read-only dictionary"
    def update(self):
        raise NotImplementedError, "This is a read-only dictionary"
    def copy(self):
        raise NotImplementedError, "You don't need to do this..."
    def keys(self):
        raise NotImplementedError, "You don't really want to do this..."
    def items(self):
        raise NotImplementedError, "You don't really want to do this..."
    def values(self):
        raise NotImplementedError, "You don't really want to do this..."
    
    def has_key(self, id):
        """has_key(self, id) -> bool"""
        try:
            self[id]
        except KeyError:
            return 0
        return 1

    def get(self, id, failobj=None):
        try:
            return self[id]
        except KeyError:
            return failobj
        raise "How did I get here?"

    def __getitem__(self, id):
        """__getitem__(self, id) -> object

        Return a Prosite entry.  id is either the id or accession
        for the entry.  Raises a KeyError if there's an error.
        
        """
        # First, check to see if enough time has passed since my
        # last query.
        self.limiter.wait()

        try:
            handle = ExPASy.get_prosite_entry(id)
        except IOError:
            raise KeyError, id
        try:
            handle = File.StringHandle(_extract_record(handle))
        except ValueError:
            raise KeyError, id
        
        if self.parser is not None:
            return self.parser.parse(handle)
        return handle.read()

class RecordParser(AbstractParser):
    """Parses Prosite data into a Record object.

    """
    def __init__(self):
        self._scanner = _Scanner()
        self._consumer = _RecordConsumer()

    def parse(self, handle):
        self._scanner.feed(handle, self._consumer)
        return self._consumer.data

class _Scanner:
    """Scans Prosite-formatted data.

    Tested with:
    Release 15.0, July 1998
    
    """
    def feed(self, handle, consumer):
        """feed(self, handle, consumer)

        Feed in Prosite data for scanning.  handle is a file-like
        object that contains prosite data.  consumer is a
        Consumer object that will receive events as the report is scanned.

        """
        if isinstance(handle, File.UndoHandle):
            uhandle = handle
        else:
            uhandle = File.UndoHandle(handle)

        while 1:
            line = uhandle.peekline()
            if not line:
                break
            elif is_blank_line(line):
                # Skip blank lines between records
                uhandle.readline()
                continue
            elif line[:2] == 'ID':
                self._scan_record(uhandle, consumer)
            elif line[:2] == 'CC':
                self._scan_copyrights(uhandle, consumer)
            else:
                raise SyntaxError, "There doesn't appear to be a record"

    def _scan_copyrights(self, uhandle, consumer):
        consumer.start_copyrights()
        self._scan_line('CC', uhandle, consumer.copyright, any_number=1)
        self._scan_terminator(uhandle, consumer)
        consumer.end_copyrights()

    def _scan_record(self, uhandle, consumer):
        consumer.start_record()
        for fn in self._scan_fns:
            fn(self, uhandle, consumer)

            # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before
            # the 3D lines, instead of the other way around.
            # Thus, I'll give the 3D lines another chance after the DO lines
            # are finished.
            if fn is self._scan_do.im_func:
                self._scan_3d(uhandle, consumer)
        consumer.end_record()

    def _scan_line(self, line_type, uhandle, event_fn,
                   exactly_one=None, one_or_more=None, any_number=None,
                   up_to_one=None):
        # Callers must set exactly one of exactly_one, one_or_more, or
        # any_number to a true value.  I do not explicitly check to
        # make sure this function is called correctly.
        
        # This does not guarantee any parameter safety, but I
        # like the readability.  The other strategy I tried was have
        # parameters min_lines, max_lines.
        
        if exactly_one or one_or_more:
            read_and_call(uhandle, event_fn, start=line_type)
        if one_or_more or any_number:
            while 1:
                if not attempt_read_and_call(uhandle, event_fn,
                                             start=line_type):
                    break
        if up_to_one:
            attempt_read_and_call(uhandle, event_fn, start=line_type)

    def _scan_id(self, uhandle, consumer):
        self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)

    def _scan_ac(self, uhandle, consumer):
        self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
    
    def _scan_dt(self, uhandle, consumer):
        self._scan_line('DT', uhandle, consumer.date, exactly_one=1)

    def _scan_de(self, uhandle, consumer):
        self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
    
    def _scan_pa(self, uhandle, consumer):
        self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
    
    def _scan_ma(self, uhandle, consumer):
        self._scan_line('MA', uhandle, consumer.matrix, any_number=1)
##        # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15
##        # contain a CC line buried within an 'MA' line.  Need to check
##        # for that.
##        while 1:
##            if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'):
##                line1 = uhandle.readline()
##                line2 = uhandle.readline()
##                uhandle.saveline(line2)
##                uhandle.saveline(line1)
##                if line1[:2] == 'CC' and line2[:2] == 'MA':
##                    read_and_call(uhandle, consumer.comment, start='CC')
##                else:
##                    break
    
    def _scan_ru(self, uhandle, consumer):
        self._scan_line('RU', uhandle, consumer.rule, any_number=1)
    
    def _scan_nr(self, uhandle, consumer):
        self._scan_line('NR', uhandle, consumer.numerical_results,
                        any_number=1)

    def _scan_cc(self, uhandle, consumer):
        self._scan_line('CC', uhandle, consumer.comment, any_number=1)
    
    def _scan_dr(self, uhandle, consumer):
        self._scan_line('DR', uhandle, consumer.database_reference,
                        any_number=1)
    
    def _scan_3d(self, uhandle, consumer):
        self._scan_line('3D', uhandle, consumer.pdb_reference,
                        any_number=1)
    
    def _scan_do(self, uhandle, consumer):
        self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)

    def _scan_terminator(self, uhandle, consumer):
        self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)

    _scan_fns = [
        _scan_id,
        _scan_ac,
        _scan_dt,
        _scan_de,
        _scan_pa,
        _scan_ma,
        _scan_ru,
        _scan_nr,
        _scan_cc,

        # This is a really dirty hack, and should be fixed properly at
        # some point.  ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309
        # in Rel 17 have lines out of order.  Thus, I have to rescan
        # these, which decreases performance.
        _scan_ma,   
        _scan_nr,
        _scan_cc,

        _scan_dr,
        _scan_3d,
        _scan_do,
        _scan_terminator
        ]

class _RecordConsumer(AbstractConsumer):
    """Consumer that converts a Prosite record to a Record object.

    Members:
    data    Record with Prosite data.

    """
    def __init__(self):
        self.data = None
        
    def start_record(self):
        self.data = Record()
        
    def end_record(self):
        self._clean_record(self.data)

    def identification(self, line):
        cols = string.split(line)
        if len(cols) != 3:
            raise SyntaxError, "I don't understand identification line\n%s" % \
                  line
        self.data.name = self._chomp(cols[1])    # don't want ';'
        self.data.type = self._chomp(cols[2])    # don't want '.'
    
    def accession(self, line):
        cols = string.split(line)
        if len(cols) != 2:
            raise SyntaxError, "I don't understand accession line\n%s" % line
        self.data.accession = self._chomp(cols[1])
    
    def date(self, line):
        uprline = string.upper(line)
        cols = string.split(uprline)

        # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE'
        if cols[2] != '(CREATED);' or \
           cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \
           cols[7][:4] != '(INF' or cols[8] != 'UPDATE).':
            raise SyntaxError, "I don't understand date line\n%s" % line

        self.data.created = cols[1]
        self.data.data_update = cols[3]
        self.data.info_update = cols[6]
    
    def description(self, line):
        self.data.description = self._clean(line)
    
    def pattern(self, line):
        self.data.pattern = self.data.pattern + self._clean(line)
    
    def matrix(self, line):
        self.data.matrix.append(self._clean(line))
    
    def rule(self, line):
        self.data.rules.append(self._clean(line))
    
    def numerical_results(self, line):
        cols = string.split(self._clean(line), ';')
        for col in cols:
            if not col:
                continue
            qual, data = map(string.lstrip, string.split(col, '='))
            if qual == '/RELEASE':
                release, seqs = string.split(data, ',')
                self.data.nr_sp_release = release
                self.data.nr_sp_seqs = int(seqs)
            elif qual == '/FALSE_NEG':
                self.data.nr_false_neg = int(data)
            elif qual == '/PARTIAL':
                self.data.nr_partial = int(data)
            elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']:
                m = re.match(r'(\d+)\((\d+)\)', data)
                if not m:
                    raise error, "Broken data %s in comment line\n%s" % \
                          (repr(data), line)
                hits = tuple(map(int, m.groups()))
                if(qual == "/TOTAL"):
                    self.data.nr_total = hits
                elif(qual == "/POSITIVE"):
                    self.data.nr_positive = hits
                elif(qual == "/UNKNOWN"):
                    self.data.nr_unknown = hits
                elif(qual == "/FALSE_POS"):
                    self.data.nr_false_pos = hits
            else:
                raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
                      (repr(qual), line)
    
    def comment(self, line):
        cols = string.split(self._clean(line), ';')
        for col in cols:
            # DNAJ_2 in Release 15 has a non-standard comment line:
            # CC   Automatic scaling using reversed database
            # Throw it away.  (Should I keep it?)
            if not col or col[:17] == 'Automatic scaling':
                continue
            qual, data = map(string.lstrip, string.split(col, '='))
            if qual == '/TAXO-RANGE':
                self.data.cc_taxo_range = data
            elif qual == '/MAX-REPEAT':
                self.data.cc_max_repeat = data
            elif qual == '/SITE':
                pos, desc = string.split(data, ',')
                self.data.cc_site = (int(pos), desc)
            elif qual == '/SKIP-FLAG':
                self.data.cc_skip_flag = data
            elif qual == '/MATRIX_TYPE':
                self.data.cc_matrix_type = data
            elif qual == '/SCALING_DB':
                self.data.cc_scaling_db = data
            elif qual == '/AUTHOR':
                self.data.cc_author = data
            elif qual == '/FT_KEY':
                self.data.cc_ft_key = data
            elif qual == '/FT_DESC':
                self.data.cc_ft_desc = data
            else:
                raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
                      (repr(qual), line)
            
    def database_reference(self, line):
        refs = string.split(self._clean(line), ';')
        for ref in refs:
            if not ref:
                continue
            acc, name, type = map(string.strip, string.split(ref, ','))
            if type == 'T':
                self.data.dr_positive.append((acc, name))
            elif type == 'F':
                self.data.dr_false_pos.append((acc, name))
            elif type == 'N':
                self.data.dr_false_neg.append((acc, name))
            elif type == 'P':
                self.data.dr_potential.append((acc, name))
            elif type == '?':
                self.data.dr_unknown.append((acc, name))
            else:
                raise SyntaxError, "I don't understand type flag %s" % type
    
    def pdb_reference(self, line):
        cols = string.split(line)
        for id in cols[1:]:  # get all but the '3D' col
            self.data.pdb_structs.append(self._chomp(id))
    
    def documentation(self, line):
        self.data.pdoc = self._chomp(self._clean(line))

    def terminator(self, line):
        pass

    def _chomp(self, word, to_chomp='.,;'):
        # Remove the punctuation at the end of a word.
        if word[-1] in to_chomp:
            return word[:-1]
        return word

    def _clean(self, line, rstrip=1):
        # Clean up a line.
        if rstrip:
            return string.rstrip(line[5:])
        return line[5:]

def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
    """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) ->
    list of PatternHit's

    Search a sequence for occurrences of Prosite patterns.  You can
    specify either a sequence in seq or a SwissProt/trEMBL ID or accession
    in id.  Only one of those should be given.  If exclude_frequent
    is true, then the patterns with the high probability of occurring
    will be excluded.

    """
    if (seq and id) or not (seq or id):
        raise ValueError, "Please specify either a sequence or an id"
    handle = ExPASy.scanprosite1(seq, id, exclude_frequent)
    return _extract_pattern_hits(handle)

def _extract_pattern_hits(handle):
    """_extract_pattern_hits(handle) -> list of PatternHit's

    Extract hits from a web page.  Raises a ValueError if there
    was an error in the query.

    """
    class parser(sgmllib.SGMLParser):
        def __init__(self):
            sgmllib.SGMLParser.__init__(self)
            self.hits = []
            self.broken_message = 'Some error occurred'
            self._in_pre = 0
            self._current_hit = None
            self._last_found = None   # Save state of parsing
        def handle_data(self, data):
            if string.find(data, 'try again') >= 0:
                self.broken_message = data
                return
            elif data == 'illegal':
                self.broken_message = 'Sequence contains illegal characters'
                return
            if not self._in_pre:
                return
            elif not string.strip(data):
                return
            if self._last_found is None and data[:4] == 'PDOC':
                self._current_hit.pdoc = data
                self._last_found = 'pdoc'
            elif self._last_found == 'pdoc':
                if data[:2] != 'PS':
                    raise SyntaxError, "Expected accession but got:\n%s" % data
                self._current_hit.accession = data
                self._last_found = 'accession'
            elif self._last_found == 'accession':
                self._current_hit.name = data
                self._last_found = 'name'
            elif self._last_found == 'name':
                self._current_hit.description = data
                self._last_found = 'description'
            elif self._last_found == 'description':
                m = re.findall(r'(\d+)-(\d+) (\w+)', data)
                for start, end, seq in m:
                    self._current_hit.matches.append(
                        (int(start), int(end), seq))
            
        def do_hr(self, attrs):
            # <HR> inside a <PRE> section means a new hit.
            if self._in_pre:
                self._current_hit = PatternHit()
                self.hits.append(self._current_hit)
                self._last_found = None
        def start_pre(self, attrs):
            self._in_pre = 1
            self.broken_message = None   # Probably not broken
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if p.broken_message:
        raise ValueError, p.broken_message
    return p.hits


def index_file(filename, indexname, rec2key=None):
    """index_file(filename, indexname, rec2key=None)

    Index a Prosite file.  filename is the name of the file.
    indexname is the name of the dictionary.  rec2key is an
    optional callback that takes a Record and generates a unique key
    (e.g. the accession number) for the record.  If not specified,
    the id name will be used.

    """
    if not os.path.exists(filename):
        raise ValueError, "%s does not exist" % filename

    index = Index.Index(indexname, truncate=1)
    index[Dictionary._Dictionary__filename_key] = filename
    
    iter = Iterator(open(filename), parser=RecordParser())
    while 1:
        start = iter._uhandle.tell()
        rec = iter.next()
        length = iter._uhandle.tell() - start
        
        if rec is None:
            break
        if rec2key is not None:
            key = rec2key(rec)
        else:
            key = rec.name
            
        if not key:
            raise KeyError, "empty key was produced"
        elif index.has_key(key):
            raise KeyError, "duplicate key %s found" % key

        index[key] = start, length

def _extract_record(handle):
    """_extract_record(handle) -> str

    Extract PROSITE data from a web page.  Raises a ValueError if no
    data was found in the web page.

    """
    # All the data appears between tags:
    # <pre width = 80>ID   NIR_SIR; PATTERN.
    # </PRE>
    class parser(sgmllib.SGMLParser):
        def __init__(self):
            sgmllib.SGMLParser.__init__(self)
            self._in_pre = 0
            self.data = []
        def handle_data(self, data):
            if self._in_pre:
                self.data.append(data)
        def do_br(self, attrs):
            if self._in_pre:
                self.data.append('\n')
        def start_pre(self, attrs):
            self._in_pre = 1
        def end_pre(self):
            self._in_pre = 0
    p = parser()
    p.feed(handle.read())
    if not p.data:
        raise ValueError, "No data found in web page."
    return string.join(p.data, '')

From mark at acoma.Stanford.EDU  Thu Jan 24 17:48:39 2002
From: mark at acoma.Stanford.EDU (Mark Lambrecht)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Prosite
In-Reply-To: <20020124142801.D372@krusty.stanford.edu>
Message-ID: <Pine.GSO.4.33.0201241445270.11128-100000@acoma.Stanford.EDU>

Jeff,

Everything works fine now.  You saved my day : I needed the info in the
prosite.dat file.

Thanks,
Mark
On Thu, 24 Jan 2002, Jeffrey Chang wrote:

> Yep, it looks like Release 17 from last month introduced some format
> changes that broke the parser.  I've updated the parser to handle the
> new lines -- __init__.py is attached.  Please try this out and let me
> know how it works.  Thanks for the report and the patch!
>
> Jeff
>
>
> On Wed, Jan 23, 2002 at 12:36:51PM -0800, Mark Lambrecht wrote:
> > Hi,
> >
> > Thanks for all the excellent Biopython code.
> > I used the Prosite parser and it breaks on a number of CC and MA lines.
> > Maybe there is a new version of the prosite.dat file ?
> > We added some code to the Bio/Prosite/__init__.py , and commented it with
> > ## (lambrecht/dyoo)
> > Then everything works again but possibly doesn't use the information in
> > these lines.
> > I attached the __init__.py
> > Could you take a look ?
> >
> > Thanks !!
> >
> > Mark
> >
> >
> > --------------------------------------------------------------------------
> > Mark Lambrecht
> > Postdoctoral Research Fellow
> > The Arabidopsis Information Resource	FAX: (650) 325-6857
> > Carnegie Institution of Washington	Tel: (650) 325-1521 ext.397
> > Department of Plant Biology		URL: http://arabidopsis.org/
> > 260 Panama St.
> > Stanford, CA 94305
> > --------------------------------------------------------------------------
>
> > # Copyright 1999 by Jeffrey Chang.  All rights reserved.
> > # This code is part of the Biopython distribution and governed by its
> > # license.  Please see the LICENSE file that should have been included
> > # as part of this package.
> >
> > # Copyright 2000 by Jeffrey Chang.  All rights reserved.
> > # This code is part of the Biopython distribution and governed by its
> > # license.  Please see the LICENSE file that should have been included
> > # as part of this package.
> >
> > """Prosite
> >
> > This module provides code to work with the prosite.dat file from
> > Prosite.
> > http://www.expasy.ch/prosite/
> >
> > Tested with:
> > Release 15.0, July 1998
> > Release 16.0, July 1999
> >
> >
> > Classes:
> > Record                Holds Prosite data.
> > PatternHit            Holds data from a hit against a Prosite pattern.
> > Iterator              Iterates over entries in a Prosite file.
> > Dictionary            Accesses a Prosite file using a dictionary interface.
> > ExPASyDictionary      Accesses Prosite records from ExPASy.
> > RecordParser          Parses a Prosite record into a Record object.
> >
> > _Scanner              Scans Prosite-formatted data.
> > _RecordConsumer       Consumes Prosite data to a Record object.
> >
> >
> > Functions:
> > scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns.
> > index_file            Index a Prosite file for a Dictionary.
> > _extract_record       Extract Prosite data from a web page.
> > _extract_pattern_hits Extract Prosite patterns from a web page.
> >
> > """
> > __all__ = [
> >     'Pattern',
> >     'Prodoc',
> >     ]
> > from types import *
> > import string
> > import re
> > import sgmllib
> > from Bio import File
> > from Bio import Index
> > from Bio.ParserSupport import *
> > from Bio.WWW import ExPASy
> > from Bio.WWW import RequestLimiter
> >
> > class Record:
> >     """Holds information from a Prosite record.
> >
> >     Members:
> >     name           ID of the record.  e.g. ADH_ZINC
> >     type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
> >     accession      e.g. PS00387
> >     created        Date the entry was created.  (MMM-YYYY)
> >     data_update    Date the 'primary' data was last updated.
> >     info_update    Date data other than 'primary' data was last updated.
> >     pdoc           ID of the PROSITE DOCumentation.
> >
> >     description    Free-format description.
> >     pattern        The PROSITE pattern.  See docs.
> >     matrix         List of strings that describes a matrix entry.
> >     rules          List of rule definitions.  (strings)
> >
> >     NUMERICAL RESULTS
> >     nr_sp_release  SwissProt release.
> >     nr_sp_seqs     Number of seqs in that release of Swiss-Prot. (int)
> >     nr_total       Number of hits in Swiss-Prot.  tuple of (hits, seqs)
> >     nr_positive    True positives.  tuple of (hits, seqs)
> >     nr_unknown     Could be positives.  tuple of (hits, seqs)
> >     nr_false_pos   False positives.  tuple of (hits, seqs)
> >     nr_false_neg   False negatives.  (int)
> >     nr_partial     False negatives, because they are fragments. (int)
> >
> >     COMMENTS
> >     cc_taxo_range  Taxonomic range.  See docs for format
> >     cc_max_repeat  Maximum number of repetitions in a protein
> >     cc_site        Interesting site.  list of tuples (pattern pos, desc.)
> >     cc_skip_flag   Can this entry be ignored?
> >
> >     DATA BANK REFERENCES - The following are all
> >                            lists of tuples (swiss-prot accession,
> >                                             swiss-prot name)
> >     dr_positive
> >     dr_false_neg
> >     dr_false_pos
> >     dr_potential   Potential hits, but fingerprint region not yet available.
> >     dr_unknown     Could possibly belong
> >
> >     pdb_structs    List of PDB entries.
> >
> >     """
> >     def __init__(self):
> >         self.name = ''
> >         self.type = ''
> >         self.accession = ''
> >         self.created = ''
> >         self.data_update = ''
> >         self.info_update = ''
> >         self.pdoc = ''
> >
> >         self.description = ''
> >         self.pattern = ''
> >         self.matrix = []
> >         self.rules = []
> >
> >         self.nr_sp_release = ''
> >         self.nr_sp_seqs = ''
> >         self.nr_total = (None, None)
> >         self.nr_positive = (None, None)
> >         self.nr_unknown = (None, None)
> >         self.nr_false_pos = (None, None)
> >         self.nr_false_neg = None
> >         self.nr_partial = None
> >
> >         self.cc_taxo_range = ''
> >         self.cc_max_repeat = ''
> >         self.cc_site = []
> >         self.cc_skip_flag = ''
> >
> >         self.dr_positive = []
> >         self.dr_false_neg = []
> >         self.dr_false_pos = []
> >         self.dr_potential = []
> >         self.dr_unknown = []
> >
> >         self.pdb_structs = []
> >
> > class PatternHit:
> >     """Holds information from a hit against a Prosite pattern.
> >
> >     Members:
> >     name           ID of the record.  e.g. ADH_ZINC
> >     accession      e.g. PS00387
> >     pdoc           ID of the PROSITE DOCumentation.
> >     description    Free-format description.
> >     matches        List of tuples (start, end, sequence) where
> >                    start and end are indexes of the match, and sequence is
> >                    the sequence matched.
> >
> >     """
> >     def __init__(self):
> >         self.name = None
> >         self.accession = None
> >         self.pdoc = None
> >         self.description = None
> >         self.matches = []
> >     def __str__(self):
> >         lines = []
> >         lines.append("%s %s %s" % (self.accession, self.pdoc, self.name))
> >         lines.append(self.description)
> >         lines.append('')
> >         if len(self.matches) > 1:
> >             lines.append("Number of matches: %s" % len(self.matches))
> >         for i in range(len(self.matches)):
> >             start, end, seq = self.matches[i]
> >             range_str = "%d-%d" % (start, end)
> >             if len(self.matches) > 1:
> >                 lines.append("%7d %10s %s" % (i+1, range_str, seq))
> >             else:
> >                 lines.append("%7s %10s %s" % (' ', range_str, seq))
> >         return string.join(lines, '\n')
> >
> > class Iterator:
> >     """Returns one record at a time from a Prosite file.
> >
> >     Methods:
> >     next   Return the next record from the stream, or None.
> >
> >     """
> >     def __init__(self, handle, parser=None):
> >         """__init__(self, handle, parser=None)
> >
> >         Create a new iterator.  handle is a file-like object.  parser
> >         is an optional Parser object to change the results into another form.
> >         If set to None, then the raw contents of the file will be returned.
> >
> >         """
> >         if type(handle) is not FileType and type(handle) is not InstanceType:
> >             raise ValueError, "I expected a file handle or file-like object"
> >         self._uhandle = File.UndoHandle(handle)
> >         self._parser = parser
> >
> >     def next(self):
> >         """next(self) -> object
> >
> >         Return the next Prosite record from the file.  If no more records,
> >         return None.
> >
> >         """
> >         # Skip the copyright info, if it's the first record.
> >         line = self._uhandle.peekline()
> >         if line[:2] == 'CC':
> >             while 1:
> >                 line = self._uhandle.readline()
> >                 if not line:
> >                     break
> >                 if line[:2] == '//':
> >                     break
> >                 if line[:2] != 'CC':
> >                     raise SyntaxError, \
> >                           "Oops, where's the copyright?"
> >
> >         lines = []
> >         while 1:
> >             line = self._uhandle.readline()
> >             if not line:
> >                 break
> >             lines.append(line)
> >             if line[:2] == '//':
> >                 break
> >
> >         if not lines:
> >             return None
> >
> >         data = string.join(lines, '')
> >         if self._parser is not None:
> >             return self._parser.parse(File.StringHandle(data))
> >         return data
> >
> > class Dictionary:
> >     """Accesses a Prosite file using a dictionary interface.
> >
> >     """
> >     __filename_key = '__filename'
> >
> >     def __init__(self, indexname, parser=None):
> >         """__init__(self, indexname, parser=None)
> >
> >         Open a Prosite Dictionary.  indexname is the name of the
> >         index for the dictionary.  The index should have been created
> >         using the index_file function.  parser is an optional Parser
> >         object to change the results into another form.  If set to None,
> >         then the raw contents of the file will be returned.
> >
> >         """
> >         self._index = Index.Index(indexname)
> >         self._handle = open(self._index[Dictionary.__filename_key])
> >         self._parser = parser
> >
> >     def __len__(self):
> >         return len(self._index)
> >
> >     def __getitem__(self, key):
> >         start, len = self._index[key]
> >         self._handle.seek(start)
> >         data = self._handle.read(len)
> >         if self._parser is not None:
> >             return self._parser.parse(File.StringHandle(data))
> >         return data
> >
> >     def __getattr__(self, name):
> >         return getattr(self._index, name)
> >
> > class ExPASyDictionary:
> >     """Access PROSITE at ExPASy using a read-only dictionary interface.
> >
> >     """
> >     def __init__(self, delay=5.0, parser=None):
> >         """__init__(self, delay=5.0, parser=None)
> >
> >         Create a new Dictionary to access PROSITE.  parser is an optional
> >         parser (e.g. Prosite.RecordParser) object to change the results
> >         into another form.  If set to None, then the raw contents of the
> >         file will be returned.  delay is the number of seconds to wait
> >         between each query.
> >
> >         """
> >         self.parser = parser
> >         self.limiter = RequestLimiter(delay)
> >
> >     def __len__(self):
> >         raise NotImplementedError, "Prosite contains lots of entries"
> >     def clear(self):
> >         raise NotImplementedError, "This is a read-only dictionary"
> >     def __setitem__(self, key, item):
> >         raise NotImplementedError, "This is a read-only dictionary"
> >     def update(self):
> >         raise NotImplementedError, "This is a read-only dictionary"
> >     def copy(self):
> >         raise NotImplementedError, "You don't need to do this..."
> >     def keys(self):
> >         raise NotImplementedError, "You don't really want to do this..."
> >     def items(self):
> >         raise NotImplementedError, "You don't really want to do this..."
> >     def values(self):
> >         raise NotImplementedError, "You don't really want to do this..."
> >
> >     def has_key(self, id):
> >         """has_key(self, id) -> bool"""
> >         try:
> >             self[id]
> >         except KeyError:
> >             return 0
> >         return 1
> >
> >     def get(self, id, failobj=None):
> >         try:
> >             return self[id]
> >         except KeyError:
> >             return failobj
> >         raise "How did I get here?"
> >
> >     def __getitem__(self, id):
> >         """__getitem__(self, id) -> object
> >
> >         Return a Prosite entry.  id is either the id or accession
> >         for the entry.  Raises a KeyError if there's an error.
> >
> >         """
> >         # First, check to see if enough time has passed since my
> >         # last query.
> >         self.limiter.wait()
> >
> >         try:
> >             handle = ExPASy.get_prosite_entry(id)
> >         except IOError:
> >             raise KeyError, id
> >         try:
> >             handle = File.StringHandle(_extract_record(handle))
> >         except ValueError:
> >             raise KeyError, id
> >
> >         if self.parser is not None:
> >             return self.parser.parse(handle)
> >         return handle.read()
> >
> > class RecordParser(AbstractParser):
> >     """Parses Prosite data into a Record object.
> >
> >     """
> >     def __init__(self):
> >         self._scanner = _Scanner()
> >         self._consumer = _RecordConsumer()
> >
> >     def parse(self, handle):
> >         self._scanner.feed(handle, self._consumer)
> >         return self._consumer.data
> >
> > class _Scanner:
> >     """Scans Prosite-formatted data.
> >
> >     Tested with:
> >     Release 15.0, July 1998
> >
> >     """
> >     def feed(self, handle, consumer):
> >         """feed(self, handle, consumer)
> >
> >         Feed in Prosite data for scanning.  handle is a file-like
> >         object that contains prosite data.  consumer is a
> >         Consumer object that will receive events as the report is scanned.
> >
> >         """
> >         if isinstance(handle, File.UndoHandle):
> >             uhandle = handle
> >         else:
> >             uhandle = File.UndoHandle(handle)
> >
> >         while 1:
> >             line = uhandle.peekline()
> >             if not line:
> >                 break
> >             elif is_blank_line(line):
> >                 # Skip blank lines between records
> >                 uhandle.readline()
> >                 continue
> >             elif line[:2] == 'ID':
> >                 self._scan_record(uhandle, consumer)
> >             elif line[:2] == 'CC':
> >                 self._scan_copyrights(uhandle, consumer)
> >             else:
> >                 raise SyntaxError, "There doesn't appear to be a record"
> >
> >     def _scan_copyrights(self, uhandle, consumer):
> >         consumer.start_copyrights()
> >         self._scan_line('CC', uhandle, consumer.copyright, any_number=1)
> >         self._scan_terminator(uhandle, consumer)
> >         consumer.end_copyrights()
> >
> >     def _scan_record(self, uhandle, consumer):
> >         consumer.start_record()
> >         for fn in self._scan_fns:
> >             fn(self, uhandle, consumer)
> >
> >             # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before
> >             # the 3D lines, instead of the other way around.
> >             # Thus, I'll give the 3D lines another chance after the DO lines
> >             # are finished.
> >             if fn is self._scan_do.im_func:
> >                 self._scan_3d(uhandle, consumer)
> >         consumer.end_record()
> >
> >     def _scan_line(self, line_type, uhandle, event_fn,
> >                    exactly_one=None, one_or_more=None, any_number=None,
> >                    up_to_one=None):
> >         # Callers must set exactly one of exactly_one, one_or_more, or
> >         # any_number to a true value.  I do not explicitly check to
> >         # make sure this function is called correctly.
> >
> >         # This does not guarantee any parameter safety, but I
> >         # like the readability.  The other strategy I tried was have
> >         # parameters min_lines, max_lines.
> >
> >         if exactly_one or one_or_more:
> >             read_and_call(uhandle, event_fn, start=line_type)
> >         if one_or_more or any_number:
> >             while 1:
> >                 if not attempt_read_and_call(uhandle, event_fn,
> >                                              start=line_type):
> >                     break
> >         if up_to_one:
> >             attempt_read_and_call(uhandle, event_fn, start=line_type)
> >
> >     def _scan_id(self, uhandle, consumer):
> >         self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
> >
> >     def _scan_ac(self, uhandle, consumer):
> >         self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
> >
> >     def _scan_dt(self, uhandle, consumer):
> >         self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
> >
> >     def _scan_de(self, uhandle, consumer):
> >         self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
> >
> >     def _scan_pa(self, uhandle, consumer):
> >         self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
> >
> >     def _scan_ma(self, uhandle, consumer):
> >         # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15
> >         # contain a CC line buried within an 'MA' line.  Need to check
> >         # for that.
> >         while 1:
> >             if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'):
> >                 line1 = uhandle.readline()
> >                 line2 = uhandle.readline()
> >                 uhandle.saveline(line2)
> >                 uhandle.saveline(line1)
> >                 if line1[:2] == 'CC' and line2[:2] == 'MA':
> >                     read_and_call(uhandle, consumer.comment, start='CC')
> >                 else:
> >                     break
> >
> >     def _scan_ru(self, uhandle, consumer):
> >         self._scan_line('RU', uhandle, consumer.rule, any_number=1)
> >
> >     def _scan_nr(self, uhandle, consumer):
> >         self._scan_line('NR', uhandle, consumer.numerical_results,
> >                         any_number=1)
> >
> >     def _scan_cc(self, uhandle, consumer):
> >         self._scan_line('CC', uhandle, consumer.comment, any_number=1)
> >
> >     def _scan_dr(self, uhandle, consumer):
> >         self._scan_line('DR', uhandle, consumer.database_reference,
> >                         any_number=1)
> >
> >     def _scan_3d(self, uhandle, consumer):
> >         self._scan_line('3D', uhandle, consumer.pdb_reference,
> >                         any_number=1)
> >
> >     def _scan_do(self, uhandle, consumer):
> >         self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
> >
> >     def _scan_terminator(self, uhandle, consumer):
> >         self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
> >
> >     _scan_fns = [
> >         _scan_id,
> >         _scan_ac,
> >         _scan_dt,
> >         _scan_de,
> >         _scan_pa,
> >         _scan_ma,
> >         _scan_ru,
> >         _scan_nr,
> >         _scan_ma,           ## (lambrecht/dyoo)  is this right?
> >         _scan_nr,           ## (lambrecht/dyoo)  is this right?
> >         _scan_cc,
> >         _scan_dr,
> >         _scan_3d,
> >         _scan_do,
> >         _scan_terminator
> >         ]
> >
> > class _RecordConsumer(AbstractConsumer):
> >     """Consumer that converts a Prosite record to a Record object.
> >
> >     Members:
> >     data    Record with Prosite data.
> >
> >     """
> >     def __init__(self):
> >         self.data = None
> >
> >     def start_record(self):
> >         self.data = Record()
> >
> >     def end_record(self):
> >         self._clean_record(self.data)
> >
> >     def identification(self, line):
> >         cols = string.split(line)
> >         if len(cols) != 3:
> >             raise SyntaxError, "I don't understand identification line\n%s" % \
> >                   line
> >         self.data.name = self._chomp(cols[1])    # don't want ';'
> >         self.data.type = self._chomp(cols[2])    # don't want '.'
> >
> >     def accession(self, line):
> >         cols = string.split(line)
> >         if len(cols) != 2:
> >             raise SyntaxError, "I don't understand accession line\n%s" % line
> >         self.data.accession = self._chomp(cols[1])
> >
> >     def date(self, line):
> >         uprline = string.upper(line)
> >         cols = string.split(uprline)
> >
> >         # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE'
> >         if cols[2] != '(CREATED);' or \
> >            cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \
> >            cols[7][:4] != '(INF' or cols[8] != 'UPDATE).':
> >             raise SyntaxError, "I don't understand date line\n%s" % line
> >
> >         self.data.created = cols[1]
> >         self.data.data_update = cols[3]
> >         self.data.info_update = cols[6]
> >
> >     def description(self, line):
> >         self.data.description = self._clean(line)
> >
> >     def pattern(self, line):
> >         self.data.pattern = self.data.pattern + self._clean(line)
> >
> >     def matrix(self, line):
> >         self.data.matrix.append(self._clean(line))
> >
> >     def rule(self, line):
> >         self.data.rules.append(self._clean(line))
> >
> >     def numerical_results(self, line):
> >         cols = string.split(self._clean(line), ';')
> >         for col in cols:
> >             if not col:
> >                 continue
> >             qual, data = map(string.lstrip, string.split(col, '='))
> >             if qual == '/RELEASE':
> >                 release, seqs = string.split(data, ',')
> >                 self.data.nr_sp_release = release
> >                 self.data.nr_sp_seqs = int(seqs)
> >             elif qual == '/FALSE_NEG':
> >                 self.data.nr_false_neg = int(data)
> >             elif qual == '/PARTIAL':
> >                 self.data.nr_partial = int(data)
> >                 ## (lambrecht/dyoo)   added temporary fix for qual //MATRIX_TYPE in CC
> >             elif qual =='/MATRIX_TYPE':
> >                 pass
> >             elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']:
> >                 m = re.match(r'(\d+)\((\d+)\)', data)
> >                 if not m:
> >                     raise error, "Broken data %s in comment line\n%s" % \
> >                           (repr(data), line)
> >                 hits = tuple(map(int, m.groups()))
> >                 if(qual == "/TOTAL"):
> >                     self.data.nr_total = hits
> >                 elif(qual == "/POSITIVE"):
> >                     self.data.nr_positive = hits
> >                 elif(qual == "/UNKNOWN"):
> >                     self.data.nr_unknown = hits
> >                 elif(qual == "/FALSE_POS"):
> >                     self.data.nr_false_pos = hits
> >             else:
> >                 raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
> >                       (repr(qual), line)
> >
> >     def comment(self, line):
> >         cols = string.split(self._clean(line), ';')
> >         for col in cols:
> >             # DNAJ_2 in Release 15 has a non-standard comment line:
> >             # CC   Automatic scaling using reversed database
> >             # Throw it away.  (Should I keep it?)
> >             if not col or col[:17] == 'Automatic scaling':
> >                 continue
> >             qual, data = map(string.lstrip, string.split(col, '='))
> >             if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR',
> >                         '/FT_KEY', '/FT_DESC'):
> >                 continue  ## (lambrecht/dyoo) This is a temporary fix until we know what
> >                           ## to do here
> >             if qual == '/TAXO-RANGE':
> >                 self.data.cc_taxo_range = data
> >             elif qual == '/MAX-REPEAT':
> >                 self.data.cc_max_repeat = data
> >             elif qual == '/SITE':
> >                 pos, desc = string.split(data, ',')
> >                 self.data.cc_site = (int(pos), desc)
> >             elif qual == '/SKIP-FLAG':
> >                 self.data.cc_skip_flag = data
> >             else:
> >                 raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
> >                       (repr(qual), line)
> >
> >     def database_reference(self, line):
> >         refs = string.split(self._clean(line), ';')
> >         for ref in refs:
> >             if not ref:
> >                 continue
> >             acc, name, type = map(string.strip, string.split(ref, ','))
> >             if type == 'T':
> >                 self.data.dr_positive.append((acc, name))
> >             elif type == 'F':
> >                 self.data.dr_false_pos.append((acc, name))
> >             elif type == 'N':
> >                 self.data.dr_false_neg.append((acc, name))
> >             elif type == 'P':
> >                 self.data.dr_potential.append((acc, name))
> >             elif type == '?':
> >                 self.data.dr_unknown.append((acc, name))
> >             else:
> >                 raise SyntaxError, "I don't understand type flag %s" % type
> >
> >     def pdb_reference(self, line):
> >         cols = string.split(line)
> >         for id in cols[1:]:  # get all but the '3D' col
> >             self.data.pdb_structs.append(self._chomp(id))
> >
> >     def documentation(self, line):
> >         self.data.pdoc = self._chomp(self._clean(line))
> >
> >     def terminator(self, line):
> >         pass
> >
> >     def _chomp(self, word, to_chomp='.,;'):
> >         # Remove the punctuation at the end of a word.
> >         if word[-1] in to_chomp:
> >             return word[:-1]
> >         return word
> >
> >     def _clean(self, line, rstrip=1):
> >         # Clean up a line.
> >         if rstrip:
> >             return string.rstrip(line[5:])
> >         return line[5:]
> >
> > def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
> >     """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) ->
> >     list of PatternHit's
> >
> >     Search a sequence for occurrences of Prosite patterns.  You can
> >     specify either a sequence in seq or a SwissProt/trEMBL ID or accession
> >     in id.  Only one of those should be given.  If exclude_frequent
> >     is true, then the patterns with the high probability of occurring
> >     will be excluded.
> >
> >     """
> >     if (seq and id) or not (seq or id):
> >         raise ValueError, "Please specify either a sequence or an id"
> >     handle = ExPASy.scanprosite1(seq, id, exclude_frequent)
> >     return _extract_pattern_hits(handle)
> >
> > def _extract_pattern_hits(handle):
> >     """_extract_pattern_hits(handle) -> list of PatternHit's
> >
> >     Extract hits from a web page.  Raises a ValueError if there
> >     was an error in the query.
> >
> >     """
> >     class parser(sgmllib.SGMLParser):
> >         def __init__(self):
> >             sgmllib.SGMLParser.__init__(self)
> >             self.hits = []
> >             self.broken_message = 'Some error occurred'
> >             self._in_pre = 0
> >             self._current_hit = None
> >             self._last_found = None   # Save state of parsing
> >         def handle_data(self, data):
> >             if string.find(data, 'try again') >= 0:
> >                 self.broken_message = data
> >                 return
> >             elif data == 'illegal':
> >                 self.broken_message = 'Sequence contains illegal characters'
> >                 return
> >             if not self._in_pre:
> >                 return
> >             elif not string.strip(data):
> >                 return
> >             if self._last_found is None and data[:4] == 'PDOC':
> >                 self._current_hit.pdoc = data
> >                 self._last_found = 'pdoc'
> >             elif self._last_found == 'pdoc':
> >                 if data[:2] != 'PS':
> >                     raise SyntaxError, "Expected accession but got:\n%s" % data
> >                 self._current_hit.accession = data
> >                 self._last_found = 'accession'
> >             elif self._last_found == 'accession':
> >                 self._current_hit.name = data
> >                 self._last_found = 'name'
> >             elif self._last_found == 'name':
> >                 self._current_hit.description = data
> >                 self._last_found = 'description'
> >             elif self._last_found == 'description':
> >                 m = re.findall(r'(\d+)-(\d+) (\w+)', data)
> >                 for start, end, seq in m:
> >                     self._current_hit.matches.append(
> >                         (int(start), int(end), seq))
> >
> >         def do_hr(self, attrs):
> >             # <HR> inside a <PRE> section means a new hit.
> >             if self._in_pre:
> >                 self._current_hit = PatternHit()
> >                 self.hits.append(self._current_hit)
> >                 self._last_found = None
> >         def start_pre(self, attrs):
> >             self._in_pre = 1
> >             self.broken_message = None   # Probably not broken
> >         def end_pre(self):
> >             self._in_pre = 0
> >     p = parser()
> >     p.feed(handle.read())
> >     if p.broken_message:
> >         raise ValueError, p.broken_message
> >     return p.hits
> >
> >
> >
> >
> > def index_file(filename, indexname, rec2key=None):
> >     """index_file(filename, indexname, rec2key=None)
> >
> >     Index a Prosite file.  filename is the name of the file.
> >     indexname is the name of the dictionary.  rec2key is an
> >     optional callback that takes a Record and generates a unique key
> >     (e.g. the accession number) for the record.  If not specified,
> >     the id name will be used.
> >
> >     """
> >     if not os.path.exists(filename):
> >         raise ValueError, "%s does not exist" % filename
> >
> >     index = Index.Index(indexname, truncate=1)
> >     index[Dictionary._Dictionary__filename_key] = filename
> >
> >     iter = Iterator(open(filename), parser=RecordParser())
> >     while 1:
> >         start = iter._uhandle.tell()
> >         rec = iter.next()
> >         length = iter._uhandle.tell() - start
> >
> >         if rec is None:
> >             break
> >         if rec2key is not None:
> >             key = rec2key(rec)
> >         else:
> >             key = rec.name
> >
> >         if not key:
> >             raise KeyError, "empty key was produced"
> >         elif index.has_key(key):
> >             raise KeyError, "duplicate key %s found" % key
> >
> >         index[key] = start, length
> >
> > def _extract_record(handle):
> >     """_extract_record(handle) -> str
> >
> >     Extract PROSITE data from a web page.  Raises a ValueError if no
> >     data was found in the web page.
> >
> >     """
> >     # All the data appears between tags:
> >     # <pre width = 80>ID   NIR_SIR; PATTERN.
> >     # </PRE>
> >     class parser(sgmllib.SGMLParser):
> >         def __init__(self):
> >             sgmllib.SGMLParser.__init__(self)
> >             self._in_pre = 0
> >             self.data = []
> >         def handle_data(self, data):
> >             if self._in_pre:
> >                 self.data.append(data)
> >         def do_br(self, attrs):
> >             if self._in_pre:
> >                 self.data.append('\n')
> >         def start_pre(self, attrs):
> >             self._in_pre = 1
> >         def end_pre(self):
> >             self._in_pre = 0
> >     p = parser()
> >     p.feed(handle.read())
> >     if not p.data:
> >         raise ValueError, "No data found in web page."
> >     return string.join(p.data, '')
> >
>
>

--------------------------------------------------------------------------
Mark Lambrecht
Postdoctoral Research Fellow
The Arabidopsis Information Resource	FAX: (650) 325-6857
Carnegie Institution of Washington	Tel: (650) 325-1521 ext.397
Department of Plant Biology		URL: http://arabidopsis.org/
260 Panama St.
Stanford, CA 94305
--------------------------------------------------------------------------


From johann at egenetics.com  Fri Jan 25 03:11:29 2002
From: johann at egenetics.com (Johann Visagie)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Re: [Zopymed] Hello fellow snakes !!!
In-Reply-To: <JEECLCLBJHNJCFKKJLKHAEKMCIAA.jon@pcgs.freeserve.co.uk>
References: <3C4FC5BA.59361A92@fagmed.uit.no> <JEECLCLBJHNJCFKKJLKHAEKMCIAA.jon@pcgs.freeserve.co.uk>
Message-ID: <20020125081129.GA56371@fling.sanbi.ac.za>

chrisf@fagmed.uit.no on 2002-01-24:
> 
> Biopython zope objects complete with interface (Blast, .....) and output
> (BioXML,....)
> Bioperl zope objects complete with interface (T-coffee) and output
> (BioXML, GAME, ...)
> Web maintainers could just install the products and Voila.
> Biopython ZClasses (and maybe BioPerl objects?) could be subclassed for
> special uses through properties interface.
> The output of one zope product could be fed into another allowing for
> complex scripts.
> A weird example.
> From Swissprot chose trypsin->fasta->phylogeny->align each
> group->consesus for each group->common restriction enzymes, .....

I think these are the sort of ideas that need to be shared with the
Biopython (and possibly Bioperl) developers!  :-)

Jon Edwards on 2002-01-25 (Fri) at 00:35:01 -0000:
> 
> For those of us not familiar with the BioInformatics field, could you give a
> little more explanation of some of those terms?
>
[ snip ]
> 
> Johann Visagie mentioned in an earlier post -
> 
> "The concept of using Zope to build a set of "bio-web-widgets" on
> top of Biopython has even been mooted at times."
> 
> - is that the sort of thing you mean?

Quite, yes.  I should mention that it had been mooted mostly by me, and
mentioned during the BioPython BoF at BOSC 2000.  To my knowledge, no
actual work - or even serious discussion - along these lines has yet
been undertaken by anyone.  At least not using Zope!

> Would this be mainly for the
> Bioinformatics community, or would it also be useful for other medical
> fields?

Mostly Bioinformatics, I would assume.

> Please excuse my ignorance, I'm from a techie, not a medical background  :-)

Ditto.  :-)

-- V

From biopython-bugs at bioperl.org  Sat Jan 26 13:22:55 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/55
Message-ID: <200201261822.g0QIMtA00439@pw600a.bioperl.org>

JitterBug notification

new message incoming/55

Message summary for PR#55
	From: newgene@bigfoot.com
	Subject: GenBank parser problem?
	Date: Sat, 26 Jan 2002 13:22:55 -0500
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From newgene@bigfoot.com Sat Jan 26 13:22:55 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434
	for <biopython-bugs@pw600a.bioperl.org>; Sat, 26 Jan 2002 13:22:55 -0500
Date: Sat, 26 Jan 2002 13:22:55 -0500
Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org>
From: newgene@bigfoot.com
To: biopython-bugs@bioperl.org
Subject: GenBank parser problem?

Full_Name: Chunlei Wu
Module: Bio/GenBank
Version: 1.00a4 
OS: win2000
Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178)


Python version: ActivePython 2.1.1

Symptom:

>>> from Bio import GenBank
>>> gi=GenBank.search_for("NM_007355")[0]
>>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
>>> record=ncbi_dict[gi]
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
  File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__
    return self.parser.parse(handle)
  File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse
    self._scanner.feed(handle, self._consumer)
  File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed
    self._parser.parseFile(handle)
  File "C:\Python21\Martel\Parser.py", line 230, in parseFile
    self.parseString(fileobj.read())
  File "C:\Python21\Martel\Parser.py", line 258, in parseString
    self._err_handler.fatalError(result)
  File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
ParserPositionException: error parsing at or beyond character 55
>>> 


Did GenBank change the format?
Thanks.

Chunlei


From r.grenyer at ic.ac.uk  Sun Jan 27 19:50:17 2002
From: r.grenyer at ic.ac.uk (Rich Grenyer)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/55
In-Reply-To: <200201261822.g0QIMtA00439@pw600a.bioperl.org>
References: <200201261822.g0QIMtA00439@pw600a.bioperl.org>
Message-ID: <a05100300b87a4f5eae30@ic.ac.uk>

Found the same problem on Friday with BioPython1.00a4 on both a 
MacPython2.2 and a Linux Python2.1 installation.

Rich


>JitterBug notification
>
>new message incoming/55
>
>Message summary for PR#55
>	From: newgene@bigfoot.com
>	Subject: GenBank parser problem?
>	Date: Sat, 26 Jan 2002 13:22:55 -0500
>	0 replies	0 followups
>
>====> ORIGINAL MESSAGE FOLLOWS <====
>
>>From newgene@bigfoot.com Sat Jan 26 13:22:55 2002
>Received: from localhost (localhost [127.0.0.1])
>	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434
>	for <biopython-bugs@pw600a.bioperl.org>; Sat, 26 Jan 2002 
>13:22:55 -0500
>Date: Sat, 26 Jan 2002 13:22:55 -0500
>Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org>
>From: newgene@bigfoot.com
>To: biopython-bugs@bioperl.org
>Subject: GenBank parser problem?
>
>Full_Name: Chunlei Wu
>Module: Bio/GenBank
>Version: 1.00a4
>OS: win2000
>Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178)
>
>
>Python version: ActivePython 2.1.1
>
>Symptom:
>
>>>>  from Bio import GenBank
>>>>  gi=GenBank.search_for("NM_007355")[0]
>>>>  ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
>>>>  record=ncbi_dict[gi]
>Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
>   File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__
>     return self.parser.parse(handle)
>   File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse
>     self._scanner.feed(handle, self._consumer)
>   File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed
>     self._parser.parseFile(handle)
>   File "C:\Python21\Martel\Parser.py", line 230, in parseFile
>     self.parseString(fileobj.read())
>   File "C:\Python21\Martel\Parser.py", line 258, in parseString
>     self._err_handler.fatalError(result)
>   File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
>     raise exception
>ParserPositionException: error parsing at or beyond character 55
>>>>
>
>
>Did GenBank change the format?
>Thanks.
>
>Chunlei
>
>
>
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev@biopython.org
>http://biopython.org/mailman/listinfo/biopython-dev


-- 
___________________________

Rich Grenyer
Mammalian Evolution and Conservation
Department of Biology and Biochemistry
Imperial College at Silwood Park
Sunningdale
Berkshire
SL5 7PY
UNITED KINGDOM

Tel: +00 44 (0)20 7594 2328
Fax: +00 44 (0)20 7594 2339
Mob: +00 44 (0)7967 632093
email: r.grenyer@ic.ac.uk
WWW: http://www.bio.ic.ac.uk/evolve/people/rich

___________________________

From pewilkinson at informaxinc.com  Mon Jan 28 11:15:06 2002
From: pewilkinson at informaxinc.com (Peter Wilkinson)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/55
In-Reply-To: <200201271704.g0RH4Lit020756@pw600a.bioperl.org>
Message-ID: <000401c1a816$f43c8b80$c920a8c0@l001696w00>

Yes, 

 release 127 is different from 126

Peter W.

> Message: 1
> Date: Sat, 26 Jan 2002 13:22:55 -0500
> From: biopython-bugs@bioperl.org
> To: biopython-dev@biopython.org
> Subject: [Biopython-dev] Notification: incoming/55
> 
> JitterBug notification
> 
> new message incoming/55
> 
> Message summary for PR#55
> 	From: newgene@bigfoot.com
> 	Subject: GenBank parser problem?
> 	Date: Sat, 26 Jan 2002 13:22:55 -0500
> 	0 replies 	0 followups
> 
> ====> ORIGINAL MESSAGE FOLLOWS <====
> 
> >From newgene@bigfoot.com Sat Jan 26 13:22:55 2002
> Received: from localhost (localhost [127.0.0.1])
> 	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434
> 	for <biopython-bugs@pw600a.bioperl.org>; Sat, 26 Jan 
> 2002 13:22:55 -0500
> Date: Sat, 26 Jan 2002 13:22:55 -0500
> Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org>
> From: newgene@bigfoot.com
> To: biopython-bugs@bioperl.org
> Subject: GenBank parser problem?
> 
> Full_Name: Chunlei Wu
> Module: Bio/GenBank
> Version: 1.00a4 
> OS: win2000
> Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178)
> 
> 
> Python version: ActivePython 2.1.1
> 
> Symptom:
> 
> >>> from Bio import GenBank
> >>> gi=GenBank.search_for("NM_007355")[0]
> >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
> >>> record=ncbi_dict[gi]
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
>   File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in 
> __getitem__
>     return self.parser.parse(handle)
>   File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse
>     self._scanner.feed(handle, self._consumer)
>   File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed
>     self._parser.parseFile(handle)
>   File "C:\Python21\Martel\Parser.py", line 230, in parseFile
>     self.parseString(fileobj.read())
>   File "C:\Python21\Martel\Parser.py", line 258, in parseString
>     self._err_handler.fatalError(result)
>   File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
>     raise exception
> ParserPositionException: error parsing at or beyond character 55
> >>> 
> 
> 
> Did GenBank change the format?
> Thanks.
> 
> Chunlei
> 
> 
> 
> 
> 
> --__--__--
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 
> 
> End of Biopython-dev Digest

From katel at worldpath.net  Wed Jan 30 05:27:09 2002
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] ECell
Message-ID: <000801c1a978$ad4f2760$010a0a0a@cadence.com>

I just committed ECell. ECell passes my test on DOS, at least  It needs more
doumentation though, so I plan to add more of an explanation.

                                                Cayte


From biopython-bugs at bioperl.org  Wed Jan 30 09:12:55 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/54
Message-ID: <200201301412.g0UECtit019709@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#54 from incoming to trash
Message summary for PR#54
	From: <toner@fastmail.ca>
	Subject: toner cartridges
	Date: Tue, 20 Nov 2001 17:48:48
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From toner@fastmail.ca Tue Nov 20 17:49:49 2001
Received: from ELIXIR.ELIXIRSOLUTIONS.NET ([64.14.239.183])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id fAKMnmA22340;
	Tue, 20 Nov 2001 17:49:48 -0500
Received: from unknown ([64.3.195.224] unverified) by ELIXIR.ELIXIRSOLUTIONS.NET with Microsoft SMTPSVC(5.0.2195.3779);
	 Wed, 21 Nov 2001 04:20:10 +0530
From: <toner@fastmail.ca>
Subject: toner cartridges
Date: Tue, 20 Nov 2001 17:48:48
Message-Id: <736.332024.362305@unknown>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Bcc:
X-OriginalArrivalTime: 20 Nov 2001 22:50:12.0328 (UTC) FILETIME=[B668B680:01C17215]


**** VORTEX SUPPLIES ****

YOUR LASER PRINTER TONER CARTRIDGE,
COPIER AND FAX CARTRIDGE CONNECTION

SAVE UP TO 30% FROM RETAIL

ORDER BY PHONE:1-888-288-9043
ORDER BY FAX: 1-888-977-1577
E-MAIL REMOVAL LINE: 1-888-248-4930


UNIVERSITY AND/OR SCHOOL PURCHASE ORDERS WELCOME. (NO CREDIT APPROVAL REQUIRED)
ALL OTHER PURCHASE ORDER REQUESTS REQUIRE CREDIT APPROVAL.
PAY BY CHECK (C.O.D), CREDIT CARD OR PURCHASE ORDER (NET 30 DAYS).

IF YOUR ORDER IS BY CREDIT CARD PLEASE LEAVE YOUR CREDIT CARD # PLUS EXPIRATION DATE. 
IF YOUR ORDER IS BY PURCHASE ORDER LEAVE YOUR SHIPPING/BILLING ADDRESSES AND YOUR P.O. NUMBER


NOTE: WE DO NOT CARRY 

1) XEROX, BROTHER, PANASONIC, FUJITSU PRODUCTS
2) HP DESKJETJET/INK JET OR BUBBLE JET CARTRIDGES 
3) CANON BUBBLE JET CARTRIDGES 
4) ANY OFFBRANDS BESIDES THE ONES LISTED BELOW.    

OUR NEW , LASER PRINTER TONER CARTRIDGE, PRICES ARE  AS FOLLOWS: 
(PLEASE ORDER BY PAGE NUMBER AND/OR ITEM NUMBER)

HEWLETT PACKARD: (ON PAGE 2)

ITEM #1  LASERJET SERIES  4L,4P (74A)------------------------$44
ITEM #2  LASERJET SERIES  1100 (92A)-------------------------$44
ITEM #3  LASERJET SERIES  2 (95A)----------------------------$39
ITEM #4  LASERJET SERIES  2P (75A)---------------------------$54 
ITEM #5  LASERJET SERIES  5P,6P,5MP, 6MP (3903A)----------  -$44
ITEM #6  LASERJET SERIES  5SI, 8000 (09A)--------------------$95
ITEM #7  LASERJET SERIES  2100, 2200 (96A)-------------------$74
ITEM #8  LASERJET SERIES  8100 (82X)-------------------------$115
ITEM #9  LASERJET SERIES  5L/6L (3906A)----------------------$39
ITEM #10 LASERJET SERIES  4V---------------------------------$95
ITEM #11 LASERJET SERIES 4000 (27X)--------------------------$79
ITEM #12 LASERJET SERIES 3SI/4SI (91A)-----------------------$54
ITEM #13 LASERJET SERIES 4, 4M, 5,5M-------------------------$49
ITEM #13A LASERJET SERIES 5000 (29X)-------------------------$125
ITEM #13B LASERJET SERIES 1200-------------------------------$59
ITEM #13C LASERJET SERIES 4100-------------------------------$99
ITEM #18   LASERJET SERIES 3100------------------------------$39
ITEM #19 LASERJET SERIES 4500 BLACK--------------------------$79
ITEM #20 LASERJET SERIES 4500 COLORS ------------------------$125

HEWLETT PACKARD FAX (ON PAGE 2)

ITEM #14 LASERFAX 500, 700 (FX1)----------$49
ITEM #15  LASERFAX 5000,7000 (FX2)--------$64
ITEM #16  LASERFAX (FX3)------------------$59
ITEM #17  LASERFAX (FX4)------------------$54


LEXMARK/IBM (ON PAGE 3)

OPTRA 4019, 4029 HIGH YIELD---------------$89
OPTRA R, 4039, 4049 HIGH YIELD-----------$105
OPTRA E310.312 HIGH YIELD----------------$79

OPTRA E-----------------------------------$59
OPTRA N----------------------------------$115
OPTRA S----------------------------------$165
OPTRA T----------------------------------$195
OPTRA E310/312---------------------------$79


EPSON (ON PAGE 4)

ACTION LASER 7000,7500,8000,9000----------$105
ACTION LASER 1000,1500--------------------$105


CANON PRINTERS (ON PAGE 5)

PLEASE CALL FOR MODELS AND UPDATED PRICES
FOR CANON PRINTER CARTRIDGES

PANASONIC (0N PAGE 7)

NEC SERIES 2 MODELS 90 AND 95----------$105

APPLE (0N PAGE 8)

LASER WRITER PRO 600 or 16/600------------------$49 
LASER WRITER SELECT 300,320,360-----------------$74
LASER WRITER 300 AND 320------------------------$54
LASER WRITER NT, 2NT----------------------------$54
LASER WRITER 12/640-----------------------------$79

CANON FAX (ON PAGE 9)

LASERCLASS 4000 (FX3)---------------------------$59
LASERCLASS 5000,6000,7000 (FX2)-----------------$54
LASERFAX 5000,7000 (FX2)------------------------$54
LASERFAX 8500,9000 (FX4)------------------------$54

CANON COPIERS (PAGE 10)

PC 3, 6RE, 7 AND 11 (A30)---------------------$69
PC 300,320,700,720,760,900,910,920(E-40)------$89


90 DAY UNLIMITED WARRANTY INCLUDED ON ALL PRODUCTS.

ALL TRADEMARKS AND BRAND NAMES LISTED ABOVE ARE PROPERTY OF THE 
RESPECTIVE HOLDERS AND USED FOR DESCRIPTIVE PURPOSES ONLY.


From biopython-bugs at bioperl.org  Wed Jan 30 09:13:53 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/55
Message-ID: <200201301413.g0UEDrit019728@pw600a.bioperl.org>

JitterBug notification

chapmanb changed notes

Message summary for PR#55
	From: newgene@bigfoot.com
	Subject: GenBank parser problem?
	Date: Sat, 26 Jan 2002 13:22:55 -0500
	0 replies 	0 followups
	Notes: NCBI added a new "linear" word to the LOCUS line which broke the parser here.
Fixed in revision 1.17 of genbank_format.py, and tests added for this case.


====> ORIGINAL MESSAGE FOLLOWS <====

>From newgene@bigfoot.com Sat Jan 26 13:22:55 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434
	for <biopython-bugs@pw600a.bioperl.org>; Sat, 26 Jan 2002 13:22:55 -0500
Date: Sat, 26 Jan 2002 13:22:55 -0500
Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org>
From: newgene@bigfoot.com
To: biopython-bugs@bioperl.org
Subject: GenBank parser problem?

Full_Name: Chunlei Wu
Module: Bio/GenBank
Version: 1.00a4 
OS: win2000
Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178)


Python version: ActivePython 2.1.1

Symptom:

>>> from Bio import GenBank
>>> gi=GenBank.search_for("NM_007355")[0]
>>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
>>> record=ncbi_dict[gi]
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
  File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__
    return self.parser.parse(handle)
  File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse
    self._scanner.feed(handle, self._consumer)
  File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed
    self._parser.parseFile(handle)
  File "C:\Python21\Martel\Parser.py", line 230, in parseFile
    self.parseString(fileobj.read())
  File "C:\Python21\Martel\Parser.py", line 258, in parseString
    self._err_handler.fatalError(result)
  File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
ParserPositionException: error parsing at or beyond character 55
>>> 


Did GenBank change the format?
Thanks.

Chunlei


From biopython-bugs at bioperl.org  Wed Jan 30 09:13:54 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/55
Message-ID: <200201301413.g0UEDsit019730@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#55 from incoming to fixed-bugs
Message summary for PR#55
	From: newgene@bigfoot.com
	Subject: GenBank parser problem?
	Date: Sat, 26 Jan 2002 13:22:55 -0500
	0 replies 	0 followups
	Notes: NCBI added a new "linear" word to the LOCUS line which broke the parser here.
Fixed in revision 1.17 of genbank_format.py, and tests added for this case.


====> ORIGINAL MESSAGE FOLLOWS <====

>From newgene@bigfoot.com Sat Jan 26 13:22:55 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id g0QIMtA00434
	for <biopython-bugs@pw600a.bioperl.org>; Sat, 26 Jan 2002 13:22:55 -0500
Date: Sat, 26 Jan 2002 13:22:55 -0500
Message-Id: <200201261822.g0QIMtA00434@pw600a.bioperl.org>
From: newgene@bigfoot.com
To: biopython-bugs@bioperl.org
Subject: GenBank parser problem?

Full_Name: Chunlei Wu
Module: Bio/GenBank
Version: 1.00a4 
OS: win2000
Submission from: pathg01-178.mdacc.tmc.edu (143.111.173.178)


Python version: ActivePython 2.1.1

Symptom:

>>> from Bio import GenBank
>>> gi=GenBank.search_for("NM_007355")[0]
>>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
>>> record=ncbi_dict[gi]
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
  File "C:\Python21\Bio\GenBank\__init__.py", line 1555, in __getitem__
    return self.parser.parse(handle)
  File "C:\Python21\Bio\GenBank\__init__.py", line 268, in parse
    self._scanner.feed(handle, self._consumer)
  File "C:\Python21\Bio\GenBank\__init__.py", line 1250, in feed
    self._parser.parseFile(handle)
  File "C:\Python21\Martel\Parser.py", line 230, in parseFile
    self.parseString(fileobj.read())
  File "C:\Python21\Martel\Parser.py", line 258, in parseString
    self._err_handler.fatalError(result)
  File "C:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
ParserPositionException: error parsing at or beyond character 55
>>> 


Did GenBank change the format?
Thanks.

Chunlei


From chapmanb at arches.uga.edu  Wed Jan 30 09:26:47 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] GenBank parsing problem
In-Reply-To: <200201261822.g0QIMtA00439@pw600a.bioperl.org>
References: <200201261822.g0QIMtA00439@pw600a.bioperl.org>
Message-ID: <20020130092647.C56210@ci350185-a.athen1.ga.home.com>

Hi Chunlei;
Thanks for reporting the problem (and thanks to others who verified it).

> >>> from Bio import GenBank
> >>> gi=GenBank.search_for("NM_007355")[0]
> >>> ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
> >>> record=ncbi_dict[gi]
> Traceback (most recent call last):
[...]
> ParserPositionException: error parsing at or beyond character 55
> >>> 
> 
> Did GenBank change the format?

Yup, it looks like they added a new "linear" word to the LOCUS line, to
complement "circular" I guess:

LOCUS       AC091001              177066 bp    DNA     linear   PRI 06-DEC-2001

Sorry, I'd tried to prepare for the new format changes, but hadn't
realized this change was going to happen. The diff to
Bio/GenBank/genbank_format.py is attached (fixes and tests for this case
are also in CVS). I checked it out on a PRI download from NCBI, and it
seems to be working for me.

Thanks again for the report! I hope this fixes your problem. Please let
me know if you have any questions.
Brad
-------------- next part --------------
Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -c -r1.16 -r1.17
*** genbank_format.py	2002/01/05 22:09:58	1.16
--- genbank_format.py	2002/01/30 13:54:05	1.17
***************
*** 106,112 ****
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
                              Martel.Opt(Martel.Opt(blank_space) + 
!                                        Martel.Str("circular")))
  
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
--- 106,113 ----
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
                              Martel.Opt(Martel.Opt(blank_space) + 
!                                        Martel.Alt(Martel.Str("circular"),
!                                                   Martel.Str("linear"))))
  
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
From reillywu at yahoo.com  Wed Jan 30 13:02:28 2002
From: reillywu at yahoo.com (Chunlei Wu)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] GenBank parsing problem
In-Reply-To: <20020130092647.C56210@ci350185-a.athen1.ga.home.com>
Message-ID: <20020130180228.41161.qmail@web20507.mail.yahoo.com>

Hi, Brad,
    Thank you for your fix. But it seems I can not get
the latest version of genbank_format.py from CVS. The
current one is still the old one ver. 1.16. Maybe some
delay of the server?

Chunlei


--- Brad Chapman <chapmanb@arches.uga.edu> wrote:
> Hi Chunlei;
> Thanks for reporting the problem (and thanks to
> others who verified it).
> 
> > >>> from Bio import GenBank
> > >>> gi=GenBank.search_for("NM_007355")[0]
> > >>>
>
ncbi_dict=GenBank.NCBIDictionary(parser=GenBank.FeatureParser())
> > >>> record=ncbi_dict[gi]
> > Traceback (most recent call last):
> [...]
> > ParserPositionException: error parsing at or
> beyond character 55
> > >>> 
> > 
> > Did GenBank change the format?
> 
> Yup, it looks like they added a new "linear" word to
> the LOCUS line, to
> complement "circular" I guess:
> 
> LOCUS       AC091001              177066 bp    DNA  
>   linear   PRI 06-DEC-2001
> 
> Sorry, I'd tried to prepare for the new format
> changes, but hadn't
> realized this change was going to happen. The diff
> to
> Bio/GenBank/genbank_format.py is attached (fixes and
> tests for this case
> are also in CVS). I checked it out on a PRI download
> from NCBI, and it
> seems to be working for me.
> 
> Thanks again for the report! I hope this fixes your
> problem. Please let
> me know if you have any questions.
> Brad
> > Index: genbank_format.py
>
===================================================================
> RCS file:
>
/home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
> retrieving revision 1.16
> retrieving revision 1.17
> diff -c -r1.16 -r1.17
> *** genbank_format.py	2002/01/05 22:09:58	1.16
> --- genbank_format.py	2002/01/30 13:54:05	1.17
> ***************
> *** 106,112 ****
>                              
> Martel.Opt(Martel.Alt(*residue_prefixes)) +
>                              
> Martel.Opt(Martel.Alt(*residue_types)) +
>                              
> Martel.Opt(Martel.Opt(blank_space) + 
> !                                       
> Martel.Str("circular")))
>   
>   date = Martel.Group("date",
>                       Martel.Re("[-\w]+"))
> --- 106,113 ----
>                              
> Martel.Opt(Martel.Alt(*residue_prefixes)) +
>                              
> Martel.Opt(Martel.Alt(*residue_types)) +
>                              
> Martel.Opt(Martel.Opt(blank_space) + 
> !                                       
> Martel.Alt(Martel.Str("circular"),
> !                                                  
> Martel.Str("linear"))))
>   
>   date = Martel.Group("date",
>                       Martel.Re("[-\w]+"))
> 


__________________________________________________
Do You Yahoo!?
Great stuff seeking new owners in Yahoo! Auctions! 
http://auctions.yahoo.com

From chapmanb at arches.uga.edu  Wed Jan 30 14:39:17 2002
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] GenBank parsing problem
In-Reply-To: <20020130180228.41161.qmail@web20507.mail.yahoo.com>
References: <20020130092647.C56210@ci350185-a.athen1.ga.home.com> <20020130180228.41161.qmail@web20507.mail.yahoo.com>
Message-ID: <20020130143917.A56849@ci350185-a.athen1.ga.home.com>

Hi Chunlei;

>     Thank you for your fix. But it seems I can not get
> the latest version of genbank_format.py from CVS. The
> current one is still the old one ver. 1.16. Maybe some
> delay of the server?

Hmm, you're right. It still is 1.16. The anonymous CVS normally syncs up
with the read/write access CVS in a few hours, so maybe something is
wrong with anonymous CVS (most of the administrators are having fun in
the sun in Arizona right now). 

Anyways, until the fix moves to anonymous CVS, you can grab the changed
file from:

http://www.bioinformatics.org/bradstuff/bp/genbank_format-1.17.py

Sorry about the pain!
Brad

From reillywu at yahoo.com  Wed Jan 30 16:53:09 2002
From: reillywu at yahoo.com (Chunlei Wu)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] GenBank parsing problem
In-Reply-To: <20020130143917.A56849@ci350185-a.athen1.ga.home.com>
Message-ID: <20020130215309.88303.qmail@web20507.mail.yahoo.com>

Hi, Brad,
    It works now on my computer. But there is
something I want to point out:
1. The current release of biopython 1.00a4 doesn't
include "Std.py" which needed by this new
genbank_format.py. Then I updated my biopython from
CVS including Martel and updated genbank_format.py to
1.17.
2. The code works fine under python shell and IDLE
environment. But can not work under PythonWin's IDE
which raised exactly the same error msg. I cann't
figure out why. It's really strange and make me
frustrated at first. But when I switched to Python
Shell, it worked fine.

Well,anyway, I think this is maybe the problem of
PythonWin.

Hope this will give people some experience who want to
update genbank_format.py.


Chunlei


--- Brad Chapman <chapmanb@arches.uga.edu> wrote:
> Hi Chunlei;
> 
> >     Thank you for your fix. But it seems I can not
> get
> > the latest version of genbank_format.py from CVS.
> The
> > current one is still the old one ver. 1.16. Maybe
> some
> > delay of the server?
> 
> Hmm, you're right. It still is 1.16. The anonymous
> CVS normally syncs up
> with the read/write access CVS in a few hours, so
> maybe something is
> wrong with anonymous CVS (most of the administrators
> are having fun in
> the sun in Arizona right now). 
> 
> Anyways, until the fix moves to anonymous CVS, you
> can grab the changed
> file from:
> 
>
http://www.bioinformatics.org/bradstuff/bp/genbank_format-1.17.py
> 
> Sorry about the pain!
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev


__________________________________________________
Do You Yahoo!?
Great stuff seeking new owners in Yahoo! Auctions! 
http://auctions.yahoo.com

From biopython-bugs at bioperl.org  Wed Jan 30 18:29:12 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/56
Message-ID: <200201302329.g0UNTCit023916@pw600a.bioperl.org>

JitterBug notification

new message incoming/56

Message summary for PR#56
	From: rree@ucdavis.edu
	Subject: GenBank parser
	Date: Wed, 30 Jan 2002 18:29:11 -0500
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From rree@ucdavis.edu Wed Jan 30 18:29:12 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912
	for <biopython-bugs@pw600a.bioperl.org>; Wed, 30 Jan 2002 18:29:12 -0500
Date: Wed, 30 Jan 2002 18:29:11 -0500
Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org>
From: rree@ucdavis.edu
To: biopython-bugs@bioperl.org
Subject: GenBank parser

Full_Name: Rick Ree
Module: GenBank/genbank_format.py
Version: CVS 1.16
OS: Linux
Submission from: loco.ucdavis.edu (169.237.66.27)


Genbank parser was choking on the plant flat files from NCBI's ftp site -- on
the LOCUS line of the record, the parser was expecting 'circular' where my file
had 'linear'.  Here is a diff that fixes the problem, but the formatting is all
wonky 'cos of this HTML form, sorry.

-Rick

Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.16
diff -u -r1.16 genbank_format.py
--- genbank_format.py	5 Jan 2002 22:09:58 -0000	1.16
+++ genbank_format.py	30 Jan 2002 23:31:47 -0000
@@ -106,7 +106,10 @@
                             Martel.Opt(Martel.Alt(*residue_prefixes)) +
                             Martel.Opt(Martel.Alt(*residue_types)) +
                             Martel.Opt(Martel.Opt(blank_space) + 
-                                       Martel.Str("circular")))
+                                       Martel.Str("circular")) +
+                            Martel.Opt(Martel.Opt(blank_space) + 
+                                       Martel.Str("linear"))
+                            )
 
 date = Martel.Group("date",
                     Martel.Re("[-\w]+"))


From biopython-bugs at bioperl.org  Wed Jan 30 19:46:32 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/56
Message-ID: <200201310046.g0V0kWit024255@pw600a.bioperl.org>

JitterBug notification

chapmanb changed notes

Message summary for PR#56
	From: rree@ucdavis.edu
	Subject: GenBank parser
	Date: Wed, 30 Jan 2002 18:29:11 -0500
	0 replies 	0 followups
	Notes: Thanks Rick. I actually just fixed this bug this morning :-). Your fix is
basically identical to mine. Thanks for the report; those sneaky fellas at NCBI
got another change by me!


====> ORIGINAL MESSAGE FOLLOWS <====

>From rree@ucdavis.edu Wed Jan 30 18:29:12 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912
	for <biopython-bugs@pw600a.bioperl.org>; Wed, 30 Jan 2002 18:29:12 -0500
Date: Wed, 30 Jan 2002 18:29:11 -0500
Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org>
From: rree@ucdavis.edu
To: biopython-bugs@bioperl.org
Subject: GenBank parser

Full_Name: Rick Ree
Module: GenBank/genbank_format.py
Version: CVS 1.16
OS: Linux
Submission from: loco.ucdavis.edu (169.237.66.27)


Genbank parser was choking on the plant flat files from NCBI's ftp site -- on
the LOCUS line of the record, the parser was expecting 'circular' where my file
had 'linear'.  Here is a diff that fixes the problem, but the formatting is all
wonky 'cos of this HTML form, sorry.

-Rick

Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.16
diff -u -r1.16 genbank_format.py
--- genbank_format.py	5 Jan 2002 22:09:58 -0000	1.16
+++ genbank_format.py	30 Jan 2002 23:31:47 -0000
@@ -106,7 +106,10 @@
                             Martel.Opt(Martel.Alt(*residue_prefixes)) +
                             Martel.Opt(Martel.Alt(*residue_types)) +
                             Martel.Opt(Martel.Opt(blank_space) + 
-                                       Martel.Str("circular")))
+                                       Martel.Str("circular")) +
+                            Martel.Opt(Martel.Opt(blank_space) + 
+                                       Martel.Str("linear"))
+                            )
 
 date = Martel.Group("date",
                     Martel.Re("[-\w]+"))


From biopython-bugs at bioperl.org  Wed Jan 30 19:46:32 2002
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:10 2005
Subject: [Biopython-dev] Notification: incoming/56
Message-ID: <200201310046.g0V0kWit024257@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#56 from incoming to fixed-bugs
Message summary for PR#56
	From: rree@ucdavis.edu
	Subject: GenBank parser
	Date: Wed, 30 Jan 2002 18:29:11 -0500
	0 replies 	0 followups
	Notes: Thanks Rick. I actually just fixed this bug this morning :-). Your fix is
basically identical to mine. Thanks for the report; those sneaky fellas at NCBI
got another change by me!


====> ORIGINAL MESSAGE FOLLOWS <====

>From rree@ucdavis.edu Wed Jan 30 18:29:12 2002
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.12.2/8.12.2) with ESMTP id g0UNTBit023912
	for <biopython-bugs@pw600a.bioperl.org>; Wed, 30 Jan 2002 18:29:12 -0500
Date: Wed, 30 Jan 2002 18:29:11 -0500
Message-Id: <200201302329.g0UNTBit023912@pw600a.bioperl.org>
From: rree@ucdavis.edu
To: biopython-bugs@bioperl.org
Subject: GenBank parser

Full_Name: Rick Ree
Module: GenBank/genbank_format.py
Version: CVS 1.16
OS: Linux
Submission from: loco.ucdavis.edu (169.237.66.27)


Genbank parser was choking on the plant flat files from NCBI's ftp site -- on
the LOCUS line of the record, the parser was expecting 'circular' where my file
had 'linear'.  Here is a diff that fixes the problem, but the formatting is all
wonky 'cos of this HTML form, sorry.

-Rick

Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.16
diff -u -r1.16 genbank_format.py
--- genbank_format.py	5 Jan 2002 22:09:58 -0000	1.16
+++ genbank_format.py	30 Jan 2002 23:31:47 -0000
@@ -106,7 +106,10 @@
                             Martel.Opt(Martel.Alt(*residue_prefixes)) +
                             Martel.Opt(Martel.Alt(*residue_types)) +
                             Martel.Opt(Martel.Opt(blank_space) + 
-                                       Martel.Str("circular")))
+                                       Martel.Str("circular")) +
+                            Martel.Opt(Martel.Opt(blank_space) + 
+                                       Martel.Str("linear"))
+                            )
 
 date = Martel.Group("date",
                     Martel.Re("[-\w]+"))