From katel at worldpath.net  Sat Dec  1 22:39:02 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Align
Message-ID: <000701c17ae2$e42349c0$010a0a0a@cadence.com>

  The sig in Generic.Alignment function, add_sequence,
    def add_sequence(self, descriptor, sequence, start = None, end = None,
                     weight = 1.0):

does not allow the caller to pass in the name of the sequence.  I think the
descriptor should have a default of the empty string and the name should be
part of the signature.

  I need the Alignment class for the SAF parser, because the SAF format
represents alignments rather than isolated sequences.

                                    Cayte


From idoerg at cc.huji.ac.il  Mon Dec  3 05:22:20 2001
From: idoerg at cc.huji.ac.il (Iddo Friedberg)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Server request
Message-ID: <Pine.GSO.4.40_heb2.09.0112031018420.21569-100000@new-shum>

Hi all,

I am not sure to whom this request should be addressed, but as this may be
of general interest to most people on the list, I am putting it here.

I have recently completed the first stage of what I call the PeCoP
("pea-cop") server. Briefly, the user enters a sequence, and receives an
annotated output of conserved positions, as determined by multiple
PSI-BLAST runs. Due to some recent manpower reshuffle, my faculty is
not-that-equipped to handle mounting of CGI-script driven pages. So I'm
bumming around. Any chance of getting this hosted on biopython.org?

As PeCoP drives a modified version of standalone PSI-BLAST,
it needs the following:

1) An installed standalone version of "my" PSI-BLAST (blastpgpI). Probably
my binary, compiled on  Linux  2.2.16-22 will work. If not, I can always
recompile.

2) The biggie: NCBI datbase versions of sequence databases. Currently I
use nrgb, whose size is in the 390MB region. Actually, that's an old nrgb.
The latest version is probably a bit larger.

For now, biopython is used in a very rudimentary fashion in PeCoP. Parsing
fasta format, mainly. But the use of it will grow as I add features...

Takers?

Iddo

--

Iddo Friedberg                                  | Tel: +972-2-6757374
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/


From chapmanb at arches.uga.edu  Thu Dec  6 08:08:20 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Align
In-Reply-To: <000701c17ae2$e42349c0$010a0a0a@cadence.com>
References: <000701c17ae2$e42349c0$010a0a0a@cadence.com>
Message-ID: <20011206080820.C45321@ci350185-a.athen1.ga.home.com>

Hi Cayte;

>   The sig in Generic.Alignment function, add_sequence,
>     def add_sequence(self, descriptor, sequence, start = None, end = None,
>                      weight = 1.0):
> 
> does not allow the caller to pass in the name of the sequence.  

Hmmm... this is what the "descriptor" argument is supposed to be for. Do
you need to pass in more than just this? 

Once thing we could do is add an additional function along the lines of:

def add_seq_record(self, record, start = None, end = None, weight = 1.0):
    
which would allow you to build up a SeqRecord object with names,
annotations, features and whatever else, and then add it into an
alignment. Would this solve your problem?

Brad

From idoerg at cc.huji.ac.il  Thu Dec  6 10:55:42 2001
From: idoerg at cc.huji.ac.il (Iddo Friedberg)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] SubsMat CVS update
In-Reply-To: <20011206080820.C45321@ci350185-a.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.40_heb2.09.0112061753260.23660-100000@new-shum>

Hi,

1) Fixed a bug in SubsMat. Half-matrices are no longer generated
automatically in the class constructor.

2) Fixed the "different-float-representations-on-different-platforms"
bug(?). Now let us hope that all look at integers in the same fashion :)

Iddo

--

Iddo Friedberg                                  | Tel: +972-2-6757374
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/


From jchang at smi.stanford.edu  Thu Dec  6 13:05:18 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] SubsMat CVS update
In-Reply-To: <Pine.GSO.4.40_heb2.09.0112061753260.23660-100000@new-shum>
References: <20011206080820.C45321@ci350185-a.athen1.ga.home.com> <Pine.GSO.4.40_heb2.09.0112061753260.23660-100000@new-shum>
Message-ID: <20011206100518.A374@krusty.stanford.edu>

Thanks!  These notes are helpful for people working with this code,
and also helpful for me when I generate the release notes for the next
release.

Jeff


On Thu, Dec 06, 2001 at 05:55:42PM +0200, Iddo Friedberg wrote:
> Hi,
> 
> 1) Fixed a bug in SubsMat. Half-matrices are no longer generated
> automatically in the class constructor.
> 
> 2) Fixed the "different-float-representations-on-different-platforms"
> bug(?). Now let us hope that all look at integers in the same fashion :)
> 
> Iddo
> 
> --
> 
> Iddo Friedberg                                  | Tel: +972-2-6757374
> Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
> The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
> POB 12272, Jerusalem 91120                      |
> Israel                                          |
> http://bioinfo.md.huji.ac.il/marg/people-home/iddo/
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev

From gec at compbio.berkeley.edu  Thu Dec  6 17:54:36 2001
From: gec at compbio.berkeley.edu (Gavin E. Crooks)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Failed Tests
Message-ID: <01120614593201.13517@sienna.berkeley.edu>

I now have 3 tests failing. (Which is alot better than a month ago.)
test_intelligenetics and test_metatool still fail, as does test_nbrf.

Gavin Crooks
gec@compbio.berkeley.edu
http://threeplusone.com

======================================================================
ERROR: test_intelligenetics
----------------------------------------------------------------------
Traceback (most recent call last):
  File "run_tests.py", line 136, in runTest
    __import__(self.test_name)
  File "test_intelligenetics.py", line 29, in ?
    src_handle = open( datafile )
IOError: [Errno 2] No such file or directory: 'IntelliGenetics/TAT_mase_nuc.txt'======================================================================
ERROR: test_metatool
----------------------------------------------------------------------
Traceback (most recent call last):
  File "run_tests.py", line 136, in runTest
    __import__(self.test_name)
  File "./test_metatool.py", line 29, in ?
    src_handle = open( datafile )
IOError: [Errno 2] No such file or directory: 'MetaTool/meta9.out'
======================================================================
ERROR: test_nbrf
----------------------------------------------------------------------
Traceback (most recent call last):
  File "run_tests.py", line 136, in runTest
    __import__(self.test_name)
  File "test_nbrf.py", line 6, in ?
    import Bio.NBRF
  File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/NBRF/__init__.py", line 24, in ?
    import nbrf_format
  File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/NBRF/nbrf_format.py", line 40, in ?
    from Bio.NBRF.ValSeq import valid_sequence_dict
ImportError: No module named ValSeq
----------------------------------------------------------------------
Ran 32 tests in 76.650s         

From chapmanb at arches.uga.edu  Fri Dec  7 10:20:05 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <01120614593201.13517@sienna.berkeley.edu>
References: <01120614593201.13517@sienna.berkeley.edu>
Message-ID: <20011207102005.A47951@ci350185-a.athen1.ga.home.com>

Hi Gavin;

> I now have 3 tests failing. (Which is alot better than a month ago.)

I'm glad it improved :-). I did some cross-version, cross-platform work
on the tests, so it's good that I actually fixed some tests.

> test_intelligenetics and test_metatool still fail, as does test_nbrf.

Hmmm, these are all Cayte's tests, and it looks like all of the failures
are due to non-committed files (NBRF was not in the setup.py file, which
I fixed, but it still failts after that).

Cayte, could you do a checkout of the CVS code in a fresh directory, and
make sure that you've committed all of the files for these test and
modules? It looks like all of the problems are missing files, which you
probably have in your local working directory, but you haven't committed
to the CVS repository.

Thanks Gavin for the heads up and making us look at this!
Brad

From jchang at smi.stanford.edu  Fri Dec  7 12:08:32 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:07 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <20011207102005.A47951@ci350185-a.athen1.ga.home.com>
References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com>
Message-ID: <20011207090832.A644@krusty.stanford.edu>

People, please check to make sure your regression tests are working
before you check them in.  Having regression tests that always fail is
worse than having no regression tests and causes a lot of wasted time.

Thanks for Gavin and Brad for checking into this!

Jeff

From katel at worldpath.net  Fri Dec  7 18:36:31 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com>
Message-ID: <000701c17f78$01453380$010a0a0a@cadence.com>

----- Original Message -----
From: "Brad Chapman" <chapmanb@arches.uga.edu>
To: <biopython-dev@biopython.org>
Sent: Friday, December 07, 2001 7:20 AM
Subject: Re: [Biopython-dev] Failed Tests


> Hi Gavin;
>
> > I now have 3 tests failing. (Which is alot better than a month ago.)
>
> I'm glad it improved :-). I did some cross-version, cross-platform work
> on the tests, so it's good that I actually fixed some tests.
>
> > test_intelligenetics and test_metatool still fail, as does test_nbrf.
>
> Hmmm, these are all Cayte's tests, and it looks like all of the failures
> are due to non-committed files (NBRF was not in the setup.py file, which
> I fixed, but it still failts after that).
>
> Cayte, could you do a checkout of the CVS code in a fresh directory, and
> make sure that you've committed all of the files for these test and
> modules? It looks like all of the problems are missing files, which you
> probably have in your local working directory, but you haven't committed
> to the CVS repository.
>
  I plan to look into it.  I was planning to fix it yesterday but my
keyboard konked out, requiring a run to ompUSA.

                                              Cayte


From katel at worldpath.net  Fri Dec  7 18:54:21 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
References: <01120614593201.13517@sienna.berkeley.edu> <20011207102005.A47951@ci350185-a.athen1.ga.home.com> <20011207090832.A644@krusty.stanford.edu>
Message-ID: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@smi.stanford.edu>
To: <biopython-dev@biopython.org>
Sent: Friday, December 07, 2001 9:08 AM
Subject: Re: [Biopython-dev] Failed Tests


> People, please check to make sure your regression tests are working
> before you check them in.  Having regression tests that always fail is
> worse than having no regression tests and causes a lot of wasted time.
>
> Thanks for Gavin and Brad for checking into this!
>
  MetaTool worked on my system because its Windows/Dos which is not case
sensitive.
Ideally I should run these tests on the Unix system but I'm queasy about
running it since I don't own the computer. (  At work we have a MITS
department to fix crashed computers ).  The tests worked on my local system.
Also in the CVS docs I  didn't see a tree listing command.  This would help
a lot in checking for missing uploads if someone knows what this is.  I
think Tarjei added ValSeq since the tests.

                                                                Cayte

                               Cayte


/


From tarjei at genome.wi.mit.edu  Fri Dec  7 16:09:19 2001
From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com>
Message-ID: <000401c17f63$6fb35c80$67135512@mit.edu>

>I think Tarjei added ValSeq since the tests.

ValSeq? Nope, not me...


- Tarjei


From katel at worldpath.net  Fri Dec  7 19:10:40 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
References: <000401c17f63$6fb35c80$67135512@mit.edu>
Message-ID: <003a01c17f7c$c677d960$010a0a0a@cadence.com>

----- Original Message -----
From: "Tarjei Mikkelsen" <tarjei@genome.wi.mit.edu>
To: "'Cayte'" <katel@worldpath.net>; "'Jeffrey Chang'"
<jchang@smi.stanford.edu>; <biopython-dev@biopython.org>
Sent: Friday, December 07, 2001 1:09 PM
Subject: RE: [Biopython-dev] Failed Tests


> >I think Tarjei added ValSeq since the tests.
>
> ValSeq? Nope, not me...
>
>
> - Tarjei
 I'll check it in anyway.

                           Cayte


From katel at worldpath.net  Fri Dec  7 19:25:29 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
References: <000401c17f63$6fb35c80$67135512@mit.edu>
Message-ID: <004601c17f7e$d83bf760$010a0a0a@cadence.com>

----- Original Message -----
From: "Tarjei Mikkelsen" <tarjei@genome.wi.mit.edu>
To: "'Cayte'" <katel@worldpath.net>; "'Jeffrey Chang'"
<jchang@smi.stanford.edu>; <biopython-dev@biopython.org>
Sent: Friday, December 07, 2001 1:09 PM
Subject: RE: [Biopython-dev] Failed Tests


> >I think Tarjei added ValSeq since the tests.
>
> ValSeq? Nope, not me...
>
>
  Sorry, I was thinking of the Pathway/metatool stuff.

                                   Cayte


From gec at compbio.berkeley.edu  Fri Dec  7 15:55:14 2001
From: gec at compbio.berkeley.edu (Gavin E. Crooks)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com>
References: <01120614593201.13517@sienna.berkeley.edu> <20011207090832.A644@krusty.stanford.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com>
Message-ID: <01120713045402.13517@sienna.berkeley.edu>

>   MetaTool worked on my system because its Windows/Dos which is not case
> sensitive.
> Ideally I should run these tests on the Unix system but I'm queasy about
> running it since I don't own the computer. (  At work we have a MITS
> department to fix crashed computers ).  The tests worked on my local system.

One possible improvement would be to use Continuous Integration.

http://www.martinfowler.com/articles/continuousIntegration.html

For example, we could have a daemon that runs once a day. It would do a
clean installation, build and test of biopython, and send out warning emails
if anything goes wrong. This would ensure that tests never stay broken for
very long.

I kind of like this idea, and I may have a go at implementing it sometime.

Gavin 

From gec at compbio.berkeley.edu  Fri Dec  7 16:15:35 2001
From: gec at compbio.berkeley.edu (Gavin E. Crooks)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <01120713045402.13517@sienna.berkeley.edu>
References: <01120614593201.13517@sienna.berkeley.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu>
Message-ID: <01120713202303.13517@sienna.berkeley.edu>

test_metatool is now giving an even more mysterious error message!? 

Gavin

======================================================================
ERROR: test_metatool
----------------------------------------------------------------------
Traceback (most recent call last):
  File "run_tests.py", line 136, in runTest
    __import__(self.test_name)
  File "./test_metatool.py", line 32, in ?
    print data
  File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaTool/Record.py", line 119, in __
str__
  File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaTool/Record.py", line 51, in __str__
    if( self.matrix != None ):
  File "/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Numeric/UserArray.py", line 163, in __ne__
    def __ne__(self,other): return self._rc(not_equal(self.array,other))
SystemError: Objects/object.c:727: bad argument to internal function
----------------------------------------------------------------------
Ran 1 tests in 1.541s             

From jchang at smi.stanford.edu  Fri Dec  7 16:55:40 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
In-Reply-To: <01120713045402.13517@sienna.berkeley.edu>
References: <01120614593201.13517@sienna.berkeley.edu> <20011207090832.A644@krusty.stanford.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu>
Message-ID: <20011207135540.D644@krusty.stanford.edu>

On Fri, Dec 07, 2001 at 12:55:14PM -0800, Gavin E. Crooks wrote:
> For example, we could have a daemon that runs once a day. It would do a
> clean installation, build and test of biopython, and send out warning emails
> if anything goes wrong. This would ensure that tests never stay broken for
> very long.
> 
> I kind of like this idea, and I may have a go at implementing it sometime.

Yeah, that would be cool.  All the bio* projects have been needing
such a system for build and regression tests.  If you're interesting
in working on this, or have ideas about how it should be done, please
contact Chris Dagdigian (dag@sonsorol.org) and the folks on the
website mailing list:
http://bioperl.org/mailman/listinfo/webteam

Jeff

From katel at worldpath.net  Fri Dec  7 20:01:25 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Align
References: <000701c17ae2$e42349c0$010a0a0a@cadence.com> <20011206080820.C45321@ci350185-a.athen1.ga.home.com>
Message-ID: <005b01c17f83$dd897120$010a0a0a@cadence.com>

----- Original Message -----
From: "Brad Chapman" <chapmanb@arches.uga.edu>
To: <biopython-dev@biopython.org>
Sent: Thursday, December 06, 2001 5:08 AM
Subject: Re: [Biopython-dev] Align


> Hi Cayte;
>
> >   The sig in Generic.Alignment function, add_sequence,
> >     def add_sequence(self, descriptor, sequence, start = None, end =
None,
> >                      weight = 1.0):
> >
> > does not allow the caller to pass in the name of the sequence.
>
> Hmmm... this is what the "descriptor" argument is supposed to be for. Do
> you need to pass in more than just this?
>
  Usually I think of name as a tag and descriptor as a line of text that
elaborates.

                                          Cayte


From katel at worldpath.net  Fri Dec  7 20:27:35 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Failed Tests
References: <01120614593201.13517@sienna.berkeley.edu> <001d01c17f7a$7ee8acc0$010a0a0a@cadence.com> <01120713045402.13517@sienna.berkeley.edu> <01120713202303.13517@sienna.berkeley.edu>
Message-ID: <006301c17f87$87cf3220$010a0a0a@cadence.com>

----- Original Message -----
From: "Gavin E. Crooks" <gec@compbio.berkeley.edu>
To: <biopython-dev@biopython.org>
Cc: "Cayte" <katel@worldpath.net>
Sent: Friday, December 07, 2001 1:15 PM
Subject: Re: [Biopython-dev] Failed Tests


> test_metatool is now giving an even more mysterious error message!?
>
> Gavin
>
> ======================================================================
> ERROR: test_metatool
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "run_tests.py", line 136, in runTest
>     __import__(self.test_name)
>   File "./test_metatool.py", line 32, in ?
>     print data
>   File
"/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaToo
l/Record.py", line 119, in __
> str__
>   File
"/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Bio/MetaToo
l/Record.py", line 51, in __str__
>     if( self.matrix != None ):
>   File
"/n/teal.berkeley.edu/home/gec/local/lib/python2.1/site-packages/Numeric/Use
rArray.py", line 163, in __ne__
>     def __ne__(self,other): return self._rc(not_equal(self.array,other))
> SystemError: Objects/object.c:727: bad argument to internal function
> ----------------------------------------------------------------------
Looks like a possible versioning problem.  My version is 20.1 but it looks
like 20.2 came down the pike in September.

I think you've got a great idea with continuous integration.  It would solve
versioning too. Let me know if I can help even though my system is Windows
for now.  Somehow I'm not enthusiastic about XP so I may go for Linux.

                                                         Cayte


From jchang at smi.stanford.edu  Tue Dec 11 13:26:51 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
Message-ID: <20011211102650.B412@krusty.stanford.edu>

Hello everybody,

We've got a lot of new stuff, so I think it's time to roll a new
release.  This will still be an alpha release, which means that new
features are ok, as long as they're relatively bug-free.

For core developers, please let me know if this is a good time to do
it, when it might be possible (e.g. after this nasty core dump gets
fixed today :), or any other issues that might be related to sending
this code out into the world...

Jeff

From chapmanb at arches.uga.edu  Tue Dec 11 13:59:05 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <20011211102650.B412@krusty.stanford.edu>
References: <20011211102650.B412@krusty.stanford.edu>
Message-ID: <20011211135905.A6612@ci350185-a.athen1.ga.home.com>

Hey Jeff;
Sweet. Glad we're getting this together. I just finished my last paper
early Monday morning and am rolling together some lab stuff I've been
working on right now, so I should have some time over the next week to
work on this. So it's a good time, I think :-)

> We've got a lot of new stuff, so I think it's time to roll a new
> release.  This will still be an alpha release, which means that new
> features are ok, as long as they're relatively bug-free.

Okay, I've got a bunch of code that could be checked in. Here's the
list:

=> Generic Application Framework (Bio/Application). This is basically
what I wrote about previously; a general way to construct commandlines
for programs. This includes a commandline for BLAST
(Bio/Blast/Program.py) and functionality for running any commandline.
This Application stuff also interacts with BioCorba, so it is very cool;
I think :-)

=> Parsers and commandline interfaces for some Emboss primer-related
programs (primer3 and primersearch). Bio/Emboss/Primer.py and Program.py
plus some martel definitions.

=> Neural Network code (Bio/NeuralNetwork). Back propagation neural
networks, plus code to convert sequences as inputs into Neural networks.

=> Basic Hidden Markov Models (Bio/HMM). This includes Standard
and Baum Welch trainers and Viterbi prediction, all based heavily on
the Durbin et al book. 

=> Genetic Algorithm code (Bio/GA). This includes a fairly general
Genetic Algorithm framework, so isn't biology specific, but useful.

=> Drawing code that interacts with the reportlab pdf generation library
(Bio/Graphics). This makes it easier to draw pretty pictures of
chromosomes, and some other chart and graph stuff.

Whew, I think that's it. The code has all been used in real life
applications (which is why I wrote it :-), and has fairly good tests
written in the standard biopython style. The lacking thing is
documentation; I haven't been able to get myself up to writing docs for
a while (too many damn papers for classes, I guess :-).

What do people think? Do you want any of this? Which modules? Do you
want me to make a tarball of the code so you can look at it? If you just
want to glance, this is in CVS at:

http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/

Let me know what you guys think. I'm very happy to donate this to
biopython if you want it, and think I should have time over the next
week to check it all in and everything.

All-done-blathering-now-ly yr's
Brad

From katel at worldpath.net  Tue Dec 11 21:36:56 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
References: <20011211102650.B412@krusty.stanford.edu>
Message-ID: <007601c182b5$dec18340$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@smi.stanford.edu>
To: <biopython-dev@biopython.org>
Sent: Tuesday, December 11, 2001 10:26 AM
Subject: [Biopython-dev] ready for release?


> Hello everybody,
>
> We've got a lot of new stuff, so I think it's time to roll a new
> release.  This will still be an alpha release, which means that new
> features are ok, as long as they're relatively bug-free.
>
> For core developers, please let me know if this is a good time to do
> it, when it might be possible (e.g. after this nasty core dump gets
> fixed today :), or any other issues that might be related to sending
> this code out into the world...
  I downloaded the latest version of NumPy and made a change to my code
which fixed a problem  Gavin pointed out.  However, my tests rely on the
repr of Matrix.  A change in the latest rev of NumPy causes the printout of
the matrices to be a little different.  I'll need to submit a new baseline.

  Also, something on Gavin's system is causing test_nbrf to fail.  I
downloaded the nbrf files from the CVS
browser and ran again and it passed on my Windows system.  On Gavin's system
it fails near the line feed.

                                                   Cayte


From idoerg at cc.huji.ac.il  Tue Dec 11 19:33:13 2001
From: idoerg at cc.huji.ac.il (Iddo Friedberg)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <20011211102650.B412@krusty.stanford.edu>
Message-ID: <Pine.GSO.4.40_heb2.09.0112120232050.16224-100000@new-shum>

Fine with me. Unless there is a bug in one of "my" modules (Subsmat, FSSP)
in which case I cannot do anything about it before the middle of next
week.

Iddo

On Tue, 11 Dec 2001, Jeffrey Chang wrote:

: Hello everybody,
:
: We've got a lot of new stuff, so I think it's time to roll a new
: release.This will still be an alpha release, which means that new
: features are ok, as long as they're relatively bug-free.
:
: For core developers, please let me knowif this is a good time to do
: it, when it might be possible (e.g. after this nasty core dump gets
: fixed today :), or any other issues that might be related to sending
: this code out into the world...
:
: Jeff
: _______________________________________________
: Biopython-dev mailing list
: Biopython-dev@biopython.org
: http://biopython.org/mailman/listinfo/biopython-dev
:

--

Iddo Friedberg                                  | Tel: +972-2-6757374
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/


From adalke at mindspring.com  Wed Dec 12 05:21:59 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com>

I'm doing some work with Martel again.  Cayte asked
earlier about a way to simplify working with Martel
callbacks.  I outlined a possible simplification.

I've implemented it.  It's called 'SimpleFields'.
You can take a look at it at
  http://www.biopython.org/~dalke/SimpleFields.py

It supports both callback and iterative styles.
(See the module docstring for examples of each.)

I'm going to add it to CVS as soon as I think up
better names than 'SimpleFields' and 'groups'.
(Although I do like 'LAX' :)

Is anyone using the iterator facility in Martel?
I would like to change the API.  Currently you pass
it the factory function which produces SAX handlers.
I would rather just pass it a SAX handler, and
trust the handler to reset itself properly with the
startDocument/endDocument methods.  (Those which
don't can easily be wrapped.)

The problem with the current API is when the handler
needs parameters then you need to create something
which passed those parameters to each instance.  It's
ugly, and it's common... I think.  I also don't like
that the object is created for every record instead
of reusing the existing one.

I don't think anyone uses this feature, so I'll go
ahead and change it unless someone gives me a good
reason otherwise.

Finally, I'm adding some common patterns to the top-level
Martel/__init__.py.  These are for things like 'Word'
which is

def Word(name = None, attrs = None):
  exp = Re(r"\w+")
  if name is None:
    if attrs is not None:
        raise TypeError("....")
    return exp
  return Group(name, exp, attrs)

The idea is to make it easier to specify, say, a list
of words on line

format = Word("species") + Whitespace() + \
         Word("count") + Whitespace() + \
         ToEol("sequence")

Has anyone started building up a collection of those
common patterns?  I've got Integer, SignedInteger, Float,
Word, and Whitespace.  I'll probably add Spaces (for
only " "), NonSpaces (up to a " ").


Comments on any of these?

                    Andrew
                    dalke@dalkescientific.com


From adalke at mindspring.com  Wed Dec 12 05:23:44 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
Message-ID: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com>

Jeff:
>We've got a lot of new stuff, so I think it's time to roll a new
>release.  This will still be an alpha release, which means that new
>features are ok, as long as they're relatively bug-free.
>
>For core developers, please let me know if this is a good time to do
>it, when it might be possible (e.g. after this nasty core dump gets
>fixed today :), or any other issues that might be related to sending
>this code out into the world...

Can you hold up until Friday?  I want to get these last bits
of Martel changes written, tested, and into CVS.  Then I can
make Johann happy by having a new Martel release.

                    Andrew


From jchang at smi.stanford.edu  Wed Dec 12 12:52:12 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com>
References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011212095212.B304@krusty.stanford.edu>

On Wed, Dec 12, 2001 at 03:23:44AM -0700, Andrew Dalke wrote:
> Can you hold up until Friday?  I want to get these last bits
> of Martel changes written, tested, and into CVS.  Then I can
> make Johann happy by having a new Martel release.

Yeah, no problem.  Please have a contingency plan so that you can back
things out if the changes are taking longer than planned.  I'm going
on vacation at the end of next week and would like to roll the release
before then!  :)

Thanks,
Jeff

From jchang at smi.stanford.edu  Wed Dec 12 13:07:27 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
In-Reply-To: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com>
References: <021e01c182f6$d5cafbe0$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011212100727.C304@krusty.stanford.edu>

On Wed, Dec 12, 2001 at 03:21:59AM -0700, Andrew Dalke wrote:
> Is anyone using the iterator facility in Martel?

Yes.  I'm using it in Bio/Medline/NLMMedlineXML to parse the
XML-formatted PubMed records.  Each XML file contains about ~30000
records and is too big to keep in memory at once.

> I would like to change the API.  Currently you pass
> it the factory function which produces SAX handlers.
> I would rather just pass it a SAX handler, and
> trust the handler to reset itself properly with the
> startDocument/endDocument methods.  (Those which
> don't can easily be wrapped.)
> 
> The problem with the current API is when the handler
> needs parameters then you need to create something
> which passed those parameters to each instance.  It's
> ugly, and it's common... I think.  I also don't like
> that the object is created for every record instead
> of reusing the existing one.

Sure.  Let me know if you do it, so that I can update my files
accordingly.  I don't think it'll be hard to handle what you describe.

> Has anyone started building up a collection of those
> common patterns?  I've got Integer, SignedInteger, Float,
> Word, and Whitespace.  I'll probably add Spaces (for
> only " "), NonSpaces (up to a " ").

Sounds good.  Looking through my code, other ones I use are Digits
(more general name for Integer), Punctuation,
and Unprintable(AnyBut(string.printable)).

Actually, could you make more general equivalents of some of the
names?  For example, presumably Digits and Integer would match the
same things, but a lot of times you want to match some numerical
characters and calling it an integer might be a tad confusing...

Jeff

From adalke at mindspring.com  Wed Dec 12 15:05:55 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com>

Me:
>> Is anyone using the iterator facility in Martel?

Jeff:
>Yes.  I'm using it in Bio/Medline/NLMMedlineXML to parse the
>XML-formatted PubMed records.  Each XML file contains about ~30000
>records and is too big to keep in memory at once.

Do you pass in just the constructor (no args) or do you need
to create a factory function instance which knows how to
pass in the args?

Can the handler object you use be reinitialized via calling
'startDocument'?

>Sure.  Let me know if you do it, so that I can update my files
>accordingly.  I don't think it'll be hard to handle what you describe.

It shouldn't be.  I'm remember the reasons I didn't do it that
way the first time, and I want to see if my concerns (mentioned
above) are true or not.

>Looking through my code, other ones I use are Digits
>(more general name for Integer), Punctuation,
>and Unprintable(AnyBut(string.printable)).
>
>Actually, could you make more general equivalents of some of the
>names?  For example, presumably Digits and Integer would match the
>same things, but a lot of times you want to match some numerical
>characters and calling it an integer might be a tad confusing...

Ah! Yes, 'Digits' is better than 'Integer'.  It also lets
me replace 'SignedInteger' with 'Integer'.

When do you use Unprintable?  When do you use Punctuation?

My 'Float' isn't very powerful, as it only understands
numbers of the form (with optional +/-)
  1
  1.
  1.2
  .2

It doesn't handle things like 1E-3, or IEEE values
like NaN or +Inf.  I could (and probably should) support
the first of these.  I'm not sure if I should the second.

                    Andrew
                    dalke@dalkescientific.com


From katel at worldpath.net  Thu Dec 13 20:26:46 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com>
Message-ID: <003301c1843e$68ca1f00$010a0a0a@cadence.com>

   I just updated the MetaTool stuff to handle empty matrices with the
latest rev of NumPy.

                                                                    Cayte


From gec at compbio.berkeley.edu  Thu Dec 13 18:41:15 2001
From: gec at compbio.berkeley.edu (Gavin E. Crooks)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <003301c1843e$68ca1f00$010a0a0a@cadence.com>
References: <021f01c182f7$14855560$0301a8c0@josiah.dalkescientific.com> <003301c1843e$68ca1f00$010a0a0a@cadence.com>
Message-ID: <0112131545390R.13517@sienna.berkeley.edu>


All regression tests pass! Well, on machine at any rate.

Hopefully nothing will break before Jeff gets the release out!


Gavin


From jchang at smi.stanford.edu  Fri Dec 14 01:47:45 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <20011211135905.A6612@ci350185-a.athen1.ga.home.com>
References: <20011211102650.B412@krusty.stanford.edu> <20011211135905.A6612@ci350185-a.athen1.ga.home.com>
Message-ID: <20011213224745.B627@krusty.stanford.edu>

On Tue, Dec 11, 2001 at 01:59:05PM -0500, Brad Chapman wrote:
> Okay, I've got a bunch of code that could be checked in. Here's the
> list:

[cut impressive list of new functionality]

> What do people think? Do you want any of this? Which modules? Do you
> want me to make a tarball of the code so you can look at it? If you just
> want to glance, this is in CVS at:

It all looks like useful functionality.  Please check it in, provided
it's working!  :)

Jeff

From jchang at smi.stanford.edu  Fri Dec 14 02:01:59 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
In-Reply-To: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com>
References: <02fd01c18348$6918c700$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011213230159.C627@krusty.stanford.edu>

On Wed, Dec 12, 2001 at 01:05:55PM -0700, Andrew Dalke wrote:
> Me:
> >> Is anyone using the iterator facility in Martel?
> 
> Jeff:
> >Yes.  I'm using it in Bio/Medline/NLMMedlineXML to parse the
> >XML-formatted PubMed records.  Each XML file contains about ~30000
> >records and is too big to keep in memory at once.

Oops, I just looked over the code.  I'm in fact not using the
iterator, but thre RecordReader.  Sorry about the confusion!


[adding Word, Integer, ... as built-in expressions]

> When do you use Unprintable?  When do you use Punctuation?

I use them both for matching things in english text.  Sometimes the
text contains unprintable characters from foreign character sets.

> My 'Float' isn't very powerful, as it only understands
> numbers of the form (with optional +/-)
>   1
>   1.
>   1.2
>   .2
> 
> It doesn't handle things like 1E-3, or IEEE values
> like NaN or +Inf.  I could (and probably should) support
> the first of these.  I'm not sure if I should the second.

It gets pretty complicated, e.g.
1.315E2.24

Jeff

From adalke at mindspring.com  Fri Dec 14 07:22:18 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <002b01c18499$f94b2a00$0301a8c0@josiah.dalkescientific.com>

Jeff:
>Oops, I just looked over the code.  I'm in fact not using the
>iterator, but thre RecordReader.  Sorry about the confusion!

No problem, and fewer changes for you!

Me:
>> When do you use Unprintable?  When do you use Punctuation?

>I use them both for matching things in english text.  Sometimes the
>text contains unprintable characters from foreign character sets.

Okay, if you say it's useful, I'll add it.  What do you
define as punctuation?

>> My 'Float' isn't very powerful, as it only understands
>> numbers of the form (with optional +/-)

>It gets pretty complicated, e.g.
>1.315E2.24

That's not a valid floating point number -- the exponent must
be an integer.

BTW, I'm working on a 'Time' submodule, which should make it
easier to parse time and date data structures.  The language
I used is based on strptime, plus some experimental extensions
to make it easier for me to use.

The idea is to make it easier to parse something like
  1970-08-22
using a pattern like
  %(4-year)-%m-%d
than having to write
  (?P<year>\d{4})-(?P<month>\d{2})-(?<day>\d{2})
all the time.

(Plus, the patterns I use are stricter, in that you can't
use a day like "43".)


For example, (with judicious newlines for clarity)

  >>> from Martel import Time
  >>> print Time.make_pattern("%m/%d/%Y")
  (?P<month?type=numeric>(0[0-9]|1[012]))/
  (?P<day?type=numeric>(0[1-9]|[12][0-9]|3[01]))/
  (?P<year?type=long>\d{4})
  >>>

  >>> parser = Time.make_expression("%(Jan) %(year)\n").make_parser()
  >>> from xml.sax import saxutils
  >>> parser.setContentHandler(saxutils.XMLGenerator())
  >>> parser.parseString("Dec 2001\n")
  <?xml version="1.0" encoding="iso-8859-1"?>
  <month type="short">Dec</month> <year type="any">2001</year>
  >>>

It's nearly done - only about an hour of work left.  Then
to add the useful patterns, and the SimpleFields (or whatever
I decide to call it).  I should be able to finish it by
Friday .. today.

The code is temporarily at
  http://www.biopython.org/~dalke/Time.py

but it uses a new 'NullOp' Expression not yet in CVS for
doing the 'make_expression' function.

                    Andrew
                    dalke@dalkescientific.com


From chapmanb at arches.uga.edu  Fri Dec 14 11:36:50 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] ready for release?
In-Reply-To: <20011213224745.B627@krusty.stanford.edu>
References: <20011211102650.B412@krusty.stanford.edu> <20011211135905.A6612@ci350185-a.athen1.ga.home.com> <20011213224745.B627@krusty.stanford.edu>
Message-ID: <20011214113650.A16036@ci350185-a.athen1.ga.home.com>

[I blather on about the modules I wrote that could be checked in]

> It all looks like useful functionality.  Please check it in, provided
> it's working!  :)

Okee dokee. Checked in. Whew. And I think it all works :-). I've checked
this on a couple of computers and on Windows, so I think all the tests
are cross-platform-good and all files are checked in. If other people 
could get the CVS and make sure all the tests pass on their computer, 
I would be very appreciative! You do need reportlab installed for 
the graphics tests to pass. (http://www.reportlab.com/download.html).

By the way, I've just checked the current CVS on Windows and all tests
pass. Yay! Thanks to everyone who worked on the cross-platform tests.

So I think we're good to go from my side, as long as I didn't
muck anything up with my checkins.

Brad

From adalke at mindspring.com  Sat Dec 15 04:42:36 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com>

Me to Jeff:
>What do you define as punctuation?

Duh!  I see there's a "string.punctuation".
"Punctuation" added to CVS.

Also added:
  Digits == \d+
  Word == \w+
  Spaces == same as \s+ except not including newline
  Unprintable == AnyBut(string.printable)

These all take an optional name and attributes for a Group.

Changed "Integer" to "[+-]?\d+" (It had been the same
as what Digits is now.)

Removed SignedInteger.

Added a new type of Expression -- NullOp.  This simplified
the implementation of Time.py

New submodule "Time.py" for building patterns and/or expressions
for parsing strings.  Has a full regression test and docstring.

Added "LAX" as a new way to handle "simple" XML records.
Docstring may need some updating.  (It's too late for me to
think clearly enough to tell if the documentation is reasonable.)
Also, additional documentation on the topic, which I send earlier
today to c.l.py, is attached to this email.

Bug fixed! - someone in personal email pointed out the named
group backreferences ("(?P=name)" construct) weren't working.
Turned out I didn't even have a regression test for that
case.  Both problems now fixed.

Regression tests added for all the new code.  All tests pass.

Some cleanup here and there.

Excepting that it would be nice if others could check that
my new code (and changes) really does work, I'm ready for
a new release.  Even ready for a new Martel release.

                    Andrew
                    dalke@dalkescientific.com

-------------- next part --------------
An embedded message was scrubbed...
From: "Andrew Dalke" <dalke@dalkescientific.com>
Subject: Re: XML parsing besides SAX and DOM
Date: Fri, 14 Dec 2001 14:13:06 -0700
Size: 4013
Url: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20011215/6bfd2f7a/ReXMLparsingbesidesSAXandDOM.nws
From chapmanb at arches.uga.edu  Sat Dec 15 10:08:02 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
In-Reply-To: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com>
References: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011215100802.A334@ci350185-a.athen1.ga.home.com>

Andrew; 
Thanks for all the Martel changes -- the new additions look great.

> Changed "Integer" to "[+-]?\d+" (It had been the same
> as what Digits is now.)
> 
> Removed SignedInteger.

The only problem I noticed was that this broke Cayte's metatool
parser since it used SignedInteger. I updated metatool to use Digits
in place of Integer and Integer in place of SignedInteger. I think
this is right based on reading your e-mails, and all tests pass now.

Cayte, can you doublecheck and make sure I did the right thing? I
don't want to break any of your code. I attached a diff with the
changes I made.

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
Index: metatool_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/MetaTool/metatool_format.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -c -r1.2 -r1.3
*** metatool_format.py	2001/09/13 00:28:47	1.2
--- metatool_format.py	2001/12/15 14:56:05	1.3
***************
*** 18,24 ****
  import string
  
  # Martel
! from Martel import Opt, Alt, Integer, SignedInteger, Group, Str, MaxRepeat
  from Martel import Any, AnyBut, RepN, Rep, Rep1, ToEol, AnyEol
  from Martel import Expression
  from Martel import RecordReader
--- 18,24 ----
  import string
  
  # Martel
! from Martel import Opt, Alt, Digits, Integer, Group, Str, MaxRepeat
  from Martel import Any, AnyBut, RepN, Rep, Rep1, ToEol, AnyEol
  from Martel import Expression
  from Martel import RecordReader
***************
*** 32,40 ****
  lower_case_letter = Group( "lower_case_letter", Any( "abcdefghijklmnopqrstuvwxyz" ) )
  digits = "0123456789"
  
! enzyme = Group( "enzyme", optional_blank_space + Integer() +
      optional_blank_space + Str( ':' ) + ToEol() )
! reaction = Group( "reaction", optional_blank_space + Integer() +
      optional_blank_space + Str( ":" ) + ToEol() )
  not_found_line = Group( "not_found_line", optional_blank_space + Str( "- not found -" ) +
      ToEol() )
--- 32,40 ----
  lower_case_letter = Group( "lower_case_letter", Any( "abcdefghijklmnopqrstuvwxyz" ) )
  digits = "0123456789"
  
! enzyme = Group( "enzyme", optional_blank_space + Digits() +
      optional_blank_space + Str( ':' ) + ToEol() )
! reaction = Group( "reaction", optional_blank_space + Digits() +
      optional_blank_space + Str( ":" ) + ToEol() )
  not_found_line = Group( "not_found_line", optional_blank_space + Str( "- not found -" ) +
      ToEol() )
***************
*** 54,61 ****
      reactions_list )
  
  rev = Group( "rev", Opt( lower_case_letter ) )
! version = Group( "version", Integer( "version_major") + Any( "." ) +
!     Integer( "version_minor") + rev )
  metatool_tag = Str( "METATOOL OUTPUT" )
  metatool_line = Group( "metatool_line", metatool_tag + blank_space +
      Str( "Version" ) + blank_space + version + ToEol() )
--- 54,61 ----
      reactions_list )
  
  rev = Group( "rev", Opt( lower_case_letter ) )
! version = Group( "version", Digits( "version_major") + Any( "." ) +
!     Digits( "version_minor") + rev )
  metatool_tag = Str( "METATOOL OUTPUT" )
  metatool_line = Group( "metatool_line", metatool_tag + blank_space +
      Str( "Version" ) + blank_space + version + ToEol() )
***************
*** 66,85 ****
  
  metabolite_count_tag = Str( "INTERNAL METABOLITES:" )
  metabolite_count_line = Group( "metabolite_count_line",  metabolite_count_tag +
!     blank_space + Integer( "num_int_metabolites" ) + ToEol() )
  
  reaction_count_tag = Str( "REACTIONS:" )
  reaction_count_line = Group( "reaction_count_line", reaction_count_tag + blank_space +
!     Integer( "num_reactions" ) + ToEol() )
  
  type_metabolite = Group( "type_metabolite", Alt( Str( "int" ), \
      Str( "external" ) ) )
  metabolite_info = Group( "metabolite_info", optional_blank_space +
!     Integer() + blank_space + type_metabolite + blank_space +
  #    Integer() + blank_space + Rep1( lower_case_letter ) +
      Rep1( AnyBut( white_space ) ) )
  metabolite_line = Group( "metabolite_line", metabolite_info + ToEol() )
! metabolites_summary = Group( "metabolites_summary", optional_blank_space + Integer() +
      blank_space + Str( "metabolites" ) + ToEol() )
  metabolites_block = Group( "metabolites_block", Rep1( metabolite_line ) +
      metabolites_summary + Rep( blank_line ) )
--- 66,85 ----
  
  metabolite_count_tag = Str( "INTERNAL METABOLITES:" )
  metabolite_count_line = Group( "metabolite_count_line",  metabolite_count_tag +
!     blank_space + Digits( "num_int_metabolites" ) + ToEol() )
  
  reaction_count_tag = Str( "REACTIONS:" )
  reaction_count_line = Group( "reaction_count_line", reaction_count_tag + blank_space +
!     Digits( "num_reactions" ) + ToEol() )
  
  type_metabolite = Group( "type_metabolite", Alt( Str( "int" ), \
      Str( "external" ) ) )
  metabolite_info = Group( "metabolite_info", optional_blank_space +
!     Digits() + blank_space + type_metabolite + blank_space +
  #    Integer() + blank_space + Rep1( lower_case_letter ) +
      Rep1( AnyBut( white_space ) ) )
  metabolite_line = Group( "metabolite_line", metabolite_info + ToEol() )
! metabolites_summary = Group( "metabolites_summary", optional_blank_space + Digits() +
      blank_space + Str( "metabolites" ) + ToEol() )
  metabolites_block = Group( "metabolites_block", Rep1( metabolite_line ) +
      metabolites_summary + Rep( blank_line ) )
***************
*** 87,99 ****
  graph_structure_heading = Group( "graph_structure_heading", optional_blank_space +
      Str( "edges" ) + blank_space + Str( "frequency of nodes" ) + ToEol() )
  graph_structure_line = Group( "graph_structure_line", optional_blank_space +
!     Integer( "edge_count" ) + blank_space + Integer( "num_nodes" ) + ToEol() )
  graph_structure_block =  Group( "graph_structure_block", \
      graph_structure_heading + Rep( blank_line ) +
      Rep1( graph_structure_line ) + Rep( blank_line ) )
  
  sum_is_constant_line = Group( "sum_is_constant_line", optional_blank_space +
!     Integer() + optional_blank_space + Any( ":" ) + optional_blank_space +
      Rep1( AnyBut( white_space ) ) +
      Rep( blank_space + Any( "+" ) + blank_space + Rep1( AnyBut( white_space ) ) ) +
      optional_blank_space + Str( "=" ) + ToEol() )
--- 87,99 ----
  graph_structure_heading = Group( "graph_structure_heading", optional_blank_space +
      Str( "edges" ) + blank_space + Str( "frequency of nodes" ) + ToEol() )
  graph_structure_line = Group( "graph_structure_line", optional_blank_space +
!     Digits( "edge_count" ) + blank_space + Digits( "num_nodes" ) + ToEol() )
  graph_structure_block =  Group( "graph_structure_block", \
      graph_structure_heading + Rep( blank_line ) +
      Rep1( graph_structure_line ) + Rep( blank_line ) )
  
  sum_is_constant_line = Group( "sum_is_constant_line", optional_blank_space +
!     Digits() + optional_blank_space + Any( ":" ) + optional_blank_space +
      Rep1( AnyBut( white_space ) ) +
      Rep( blank_space + Any( "+" ) + blank_space + Rep1( AnyBut( white_space ) ) ) +
      optional_blank_space + Str( "=" ) + ToEol() )
***************
*** 114,121 ****
  
  reduced_system_tag = Group( "reduced_system_tag", Str( "REDUCED SYSTEM" ) )
  reduced_system_line = Group( "reduced_system_line", reduced_system_tag +
!     Rep1(  AnyBut( digits ) ) + Integer( "branch_points" ) +
!     Rep1( AnyBut( digits ) ) + Integer() + ToEol() )
  
  kernel_tag = Group( "kernel_tag", Str( "KERNEL" ) )
  kernel_line = Group( "kernel_line", kernel_tag + ToEol() )
--- 114,121 ----
  
  reduced_system_tag = Group( "reduced_system_tag", Str( "REDUCED SYSTEM" ) )
  reduced_system_line = Group( "reduced_system_line", reduced_system_tag +
!     Rep1(  AnyBut( digits ) ) + Digits( "branch_points" ) +
!     Rep1( AnyBut( digits ) ) + Digits() + ToEol() )
  
  kernel_tag = Group( "kernel_tag", Str( "KERNEL" ) )
  kernel_line = Group( "kernel_line", kernel_tag + ToEol() )
***************
*** 134,146 ****
  elementary_modes_line = Group( "elementary_modes_line", \
      elementary_modes_tag + ToEol() )
  
! num_rows = Group( "num_rows", Integer() )
! num_cols = Group( "num_cols", Integer() )
  matrix_header = Group( "matrix_header", optional_blank_space +
      Str( "matrix dimension" ) + blank_space  + Any( "r" ) +
      num_rows + blank_space +  Any( "x" ) + blank_space +
      Any( "c" ) + num_cols + optional_blank_space + AnyEol() )
! matrix_element = Group( "matrix_element", SignedInteger() )
  matrix_row = Group( "matrix_row", MaxRepeat( optional_blank_space + matrix_element, \
      "num_cols", "num_cols" ) + ToEol() )
  matrix = Group( "matrix", MaxRepeat( matrix_row, "num_rows", "num_rows" ) )
--- 134,146 ----
  elementary_modes_line = Group( "elementary_modes_line", \
      elementary_modes_tag + ToEol() )
  
! num_rows = Group( "num_rows", Digits() )
! num_cols = Group( "num_cols", Digits() )
  matrix_header = Group( "matrix_header", optional_blank_space +
      Str( "matrix dimension" ) + blank_space  + Any( "r" ) +
      num_rows + blank_space +  Any( "x" ) + blank_space +
      Any( "c" ) + num_cols + optional_blank_space + AnyEol() )
! matrix_element = Group( "matrix_element", Integer() )
  matrix_row = Group( "matrix_row", MaxRepeat( optional_blank_space + matrix_element, \
      "num_cols", "num_cols" ) + ToEol() )
  matrix = Group( "matrix", MaxRepeat( matrix_row, "num_rows", "num_rows" ) )
***************
*** 166,175 ****
      blank_space + Str( "reactions" ) + ToEol() )
  branch_metabolite = Group( "branch_metabolite", optional_blank_space +
      Rep1( AnyBut( white_space ) ) + blank_space +
!     RepN( Integer() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() )
  non_branch_metabolite = Group( "non_branch_metabolite", optional_blank_space +
      Rep1( AnyBut( white_space ) ) + blank_space +
!     RepN( Integer() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() )
  branch_metabolite_block = Group( "branch_metabolite_block", \
      metabolite_roles_heading +
      metabolite_role_cols + Rep( branch_metabolite ) )
--- 166,175 ----
      blank_space + Str( "reactions" ) + ToEol() )
  branch_metabolite = Group( "branch_metabolite", optional_blank_space +
      Rep1( AnyBut( white_space ) ) + blank_space +
!     RepN( Digits() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() )
  non_branch_metabolite = Group( "non_branch_metabolite", optional_blank_space +
      Rep1( AnyBut( white_space ) ) + blank_space +
!     RepN( Digits() + blank_space, 3 ) + Rep1( Any( "ir" ) ) + ToEol() )
  branch_metabolite_block = Group( "branch_metabolite_block", \
      metabolite_roles_heading +
      metabolite_role_cols + Rep( branch_metabolite ) )
***************
*** 235,238 ****
     metabolite_count_block + reaction_count_block + stoichiometric_block +
      Opt( not_balanced_block ) + kernel_block + subsets_block +
      reduced_system_block + convex_basis_block + conservation_relations_block +
!     elementary_modes_block )
\ No newline at end of file
--- 235,238 ----
     metabolite_count_block + reaction_count_block + stoichiometric_block +
      Opt( not_balanced_block ) + kernel_block + subsets_block +
      reduced_system_block + convex_basis_block + conservation_relations_block +
!     elementary_modes_block )
From katel at worldpath.net  Mon Dec 17 03:48:47 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
References: <004d01c1854c$d480bce0$0301a8c0@josiah.dalkescientific.com> <20011215100802.A334@ci350185-a.athen1.ga.home.com>
Message-ID: <002301c186d7$a5b6dd40$010a0a0a@cadence.com>

> Cayte, can you doublecheck and make sure I did the right thing? I
> don't want to break any of your code. I attached a diff with the
> changes I made.
> 
  Seems OK.

                                                 Cayte


From adalke at mindspring.com  Mon Dec 17 01:10:23 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <008a01c186c1$841281a0$0301a8c0@josiah.dalkescientific.com>

Brad:
>The only problem I noticed was that this broke Cayte's metatool
>parser since it used SignedInteger. I updated metatool to use Digits
>in place of Integer and Integer in place of SignedInteger. I think
>this is right based on reading your e-mails, and all tests pass now.

My apologies.  I was so concerned with getting Martel out that
I forgot to run the regression tests in biopython.

I checked and it seems that's the only use of Integer or
SignedInteger in the biopython release.

>Cayte, can you doublecheck and make sure I did the right thing? I
>don't want to break any of your code. I attached a diff with the
>changes I made.

Cayte answered this.  I concur, although I haven't run the
tests.

                    Andrew


From jchang at smi.stanford.edu  Mon Dec 17 15:46:36 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] rolling release tonight!
Message-ID: <20011217124636.B330@krusty.stanford.edu>

Hey Developers,

The regression tests seem to be working, with the exception of the
test_GraphicsXXX ones that are failing on my system because I don't
have reportlab installed.  I think the proper way to fix this is to
skip tests on systems that don't have the required components
installed.  Unless someone wants to implement this today, I'm going to
let this slide for this release, and put a note about it in the README
file.

I'm going to build this release tonight, unless I get a red flag from
someone.  If everything goes right, we'll have a shiny new biopython
in the morning!

Jeff

From chapmanb at arches.uga.edu  Mon Dec 17 16:35:47 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] rolling release tonight!
In-Reply-To: <20011217124636.B330@krusty.stanford.edu>
References: <20011217124636.B330@krusty.stanford.edu>
Message-ID: <20011217163547.A4422@ci350185-a.athen1.ga.home.com>

Hey Jeff;

> The regression tests seem to be working, with the exception of the
> test_GraphicsXXX ones that are failing on my system because I don't
> have reportlab installed.  I think the proper way to fix this is to
> skip tests on systems that don't have the required components
> installed.  Unless someone wants to implement this today, I'm going to
> let this slide for this release, and put a note about it in the README
> file.

Okee dokee, I coded this in. There really isn't any skip support in
pyunit (which appears to be a deliberate design decision), so now
things look like this when there is an import problem:

[...]
test_GenBankFormat ... ok
test_GraphicsChromosome ... Skipping test because of import error:
No module named reportlab.pdfgen
ok
test_GraphicsDistribution ... Skipping test because of import error:
No module named reportlab.pdfgen
ok
test_GraphicsGeneral ... Skipping test because of import error: No
module named reportlab.pdfgen
ok
test_HMMCasino ... ok
[...]

Hopefully this works okay for ya. No major error, but at least some
notice of skipping the test.

> I'm going to build this release tonight, unless I get a red flag from
> someone.  If everything goes right, we'll have a shiny new biopython
> in the morning!

Sweet!

Giving-biopython-to-all-my-relatives-for-Christmas-ly yr's,
Brad
-- 
PGP public key available from http://pgp.mit.edu/

From adalke at mindspring.com  Mon Dec 17 16:52:43 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] rolling release tonight!
Message-ID: <00bd01c18745$283a7f20$0301a8c0@josiah.dalkescientific.com>

Jeff:
>I'm going to build this release tonight, unless I get a red flag from
>someone.  If everything goes right, we'll have a shiny new biopython
>in the morning!

There are a couple minor changes to Martel; mostly in the
documentation and the setup.py.  They do not affect the build
but I'll work on finishing them up this afternoon so there
can be an independent Martel release in parallel to the
spiffy new biopython.

There's also a couple really minor code changes to make
with the new code, so shouldn't affect anyone.

I'll send email when it's ready, and if it's after 6pm my
time I'll be surprised.

                    Andrew


From adalke at mindspring.com  Tue Dec 18 00:20:06 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] rolling release tonight!
Message-ID: <000c01c18783$a825ce40$0301a8c0@josiah.dalkescientific.com>

Me:
>I'll send email when it's ready, and if it's after 6pm my
>time I'll be surprised.

I'm surprised.

Well, I did spend more time working on the documentation
(in the README).  The biggest problem was that distutils
doesn't let me install some of the test data files where
I thought they should go.  But it's for the best, as now
the regression tests aren't installed at all; they're
only part of the build.

I can also make a 'setup.py sdist' and that works.

Oh, I added two more definitions to Martel
  Martel.ToSep -- parse up to a seperator character (or
      one of several seperator characters)
  Martel.DelimitedFields -- parse text seperated by a
      delimiter character (or characters)

and made the default iterator return LAX records.


For example, the easiest way to parse /etc/passwd, say, to
print out which account uses which shell, is

import Martel
format = Martel.Rep(Martel.Group("record",
                       Martel.DelimitedFields("field", ":")))
infile = open("/etc/passwd")
for record in format.make_iterator("record").iterateFile(infile):
    print record["field"][0], "uses", record["field"][-1]


                    Andrew


From jchang at smi.stanford.edu  Tue Dec 18 03:24:12 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] rolling release tonight!
In-Reply-To: <20011217163547.A4422@ci350185-a.athen1.ga.home.com>
References: <20011217124636.B330@krusty.stanford.edu> <20011217163547.A4422@ci350185-a.athen1.ga.home.com>
Message-ID: <20011218002412.C320@krusty.stanford.edu>

On Mon, Dec 17, 2001 at 04:35:47PM -0500, Brad Chapman wrote:
> > The regression tests seem to be working, with the exception of the
> > test_GraphicsXXX ones that are failing on my system because I don't
> > have reportlab installed.  I think the proper way to fix this is to
> > skip tests on systems that don't have the required components
> > installed.  Unless someone wants to implement this today, I'm going to
> > let this slide for this release, and put a note about it in the README
> > file.
> 
> Okee dokee, I coded this in. There really isn't any skip support in
> pyunit (which appears to be a deliberate design decision), so now
> things look like this when there is an import problem:

Great!  Thanks for getting on this so fast!  The release is now out.

Thanks,
Jeff

From adalke at mindspring.com  Fri Dec 21 06:02:17 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] format autodection
Message-ID: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com>

Hey all,

I'm getting back to working on Biopython.  I want to spend some time
on the file parsing code.  (Like, duh! :) The topics I want to work on
next include:

  - automatic file identification
  - iterating through records in a file
  - support for different record types
  - converting/writing records to a given format

I'll send an email for each point, starting now.

I have some ideas on file identification.  In theory, Martel could be
used by just |'ing the terms, except that:
  - some files may by parsable by multiple formats
  - a Martel definition parses the whole file, when file type
     identification need only parse part of the file
  - it's a linear search

What I'm toying around with is something like this:

def _recognizeFile(parser, infile):
    pos = infile.tell()
    err_h = ... something which can distinguish between a bad
        parse, and a successful one where unparsed text remains
        (I'm changing Martel to distiguish the two.)
    parser.setErrorHandler(err_h)
    try:
        try:
            parser.parseFile(infile)
        except Martel.Parser.ParserError:
            pass
    finally:
        infile.seek(pos)
    return err_h.successful_parse

class Format:
  def __init__(self, format_name, expression, recognize_expression = None,
               provider_url = None, documentation_url = None,
               description, short_description, maintainer...):
    if recognize_expression is None:
        recognize_expression = expression
    self.expression = expression
    ...
  def recognizeFile(self, infile):
    if _recognizeFile(self.recognize_expression.make_parser(), infile):
        return self
    return None

class RecognizeFormats:
  def __init__(self, recognize_expression, formats = None):
    ...
  def recognizeFile(self, infile):
    if _recognizeFile(self.recognize_expression.make_parser(), infile):
        for format in self.formats:
            x = format.recognizeFile(infile)
            if x is not None:
                return x
    return None


This makes it possible to say

  from bioformats import swissprot
  swissprot38 = Format("swissprot/version=38",
                       expression = swissprot.swissprot38.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot39 = Format("swissprot/version=39",
                       expression = swissprot.swissprot39.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot40 = Format("swissprot/version=40",
                       expression = swissprot.swissprot40.format,
                       recognize_expression = swissprot.swissprot38.record)
  swissprot = RecognizeFormats(
                Martel.Str("ID  ") + Martel.ToEol() + \
                Martel.Str("AC  ") + Martel.ToEol(),
                [swissprot40, swissprot39, swisprot38])

  swissprot_like = RecognizeFormats(
                Martel.Re(r"[^ ][^ ]   "),
                [swissprot, ipi, ...])

  # This has GenBank records in a row/ no header
  genbank_records = Format("genbank", ...)
  # This has the header for the Genbank release
  genbank_release = Format("genbank-release", ...)

  genbank = RecognizeFormats(None, [genbank_records, genbank_release])
                             

  # Not saying this is the best prefilter
  pdb = RecognizeFormats(Martel.Re("ATOM  |HETATM|HEADER"),
                         [many variations])

  sequence_format = RecognizeFormats(None,
                       [swissprot_like, genbank, pdb, ...])

  structure_format = RecognizeFormats(None, [pdb, mdl, ...])

  any = RecognizeFormats(None, [sequence, alignment, structure])


The result can be used like this:

  format = sequence_format.recognizeFile(open("unknown.file"))
  print "It's a", format.name

I've tried this out.  It works.  Given a file or string, I can get a
Format definition which (claims to) parse it.


There are several things I haven't figured out:

1) How are the formats named?  I made up "swissprot/version=38".  Is
the version attribute enough?  If there are other attributes, is there
a canonical ordering of attributes.

2) Does the word "recognize" make sense in this context?  I tried
"identifier" but that's also a commonly used noun.  (I choose
"recognize" from a post of Thomas's from the end of summer.)q

3) Is information about the intermediate nodes in the tree useful?

4) How are new formats registered?  Manually?  Or is there a way to
autoadd them by dropping files in appropriately designated
directories?

5) The top-level definitions require all the lower-level definitions
to be available.  If there are 50 formats, that might take a while.
There needs to be some way to defer loading modules until the parent
RecognizeFormats class is asked to recognize something.

6) Version detection depends on tell/seek working.  There needs to be
a simple wrapper for inputs (like URLs, and sys.stdin) which don't
support that action.  Jeff added something like this already.

7) What do I do with the format definition once I have it?

8) Does this idea make sense to others?

                                   Andrew
                                   dalke@dalkescientific.com


From adalke at mindspring.com  Fri Dec 21 06:02:26 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] record iteration
Message-ID: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com>

Most data files are of this form:

<dataset>
<header>...</header>  (optional)
<record>...</record>  (one or more)
<footer>...</footer>  (optional)
</dataset>


Nearly everyone only wants to read the records from this file, using a
mechanism like this:

for record in file:
    do_something(record)

and don't care about the header and footer information.  In Martel
this can be done by passing in the tag name of the record boundary to
the make_iterator method.

iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):
    do_something(record.document)

If we standardize on the tag name of "record" then this will work for
everything.

The existing formats I wrote do not use this standard because they
only allowed a tag name.  They had things like "swissprot38_record".
With the changes I made this summer, Martel grammers can include
attributes for the element, as in:

<record format="swissprot" version="38">
 ...
</record>

So my proposal is to standardize on certain tag names, to be shared
across all of the Biopython/Martel grammars.  These include:
  dataset
  record
  header
  footer

and allow for a standard scaffold for parsing sequence records.

BTW, those standard tag names should also include
  primary_id
  description (free-form text)
  sequence   (single letter codes)
  sequence3  (three letter code)
  xref       (cross reference to another database)
  ... others?

As we rework the format definitions, some of these will become apparant.

This starts getting into BioXML-type work.

                                   Andrew
                                   dalke@dalkescientific.com


From adalke at mindspring.com  Fri Dec 21 06:03:03 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] building a data object
Message-ID: <003c01c18a0f$120a6c20$0301a8c0@josiah.dalkescientific.com>

Bioperl has only one sequence record data object.  One of the points
behind Biopython's two parsing systems was to allow the building of
different objects without having to rewrite the parser as well.
(BioJava has a similar goal, but is more akin to the first Biopython
parser and not the Martel one.)

Take the example I gave in my previous post:

iterator = format.make_iterator("record")
for record in iterator.parseFile(open(filename), Builder()):
    do_something(record.document)

In this case, the 'Builder()' is an object which translates SAX events
from whichever format is given into a 'document' of whatever is
desired.  For example, it could be a

  Swissprot2SeqRecordBuilder
  GenBank2LightweightSeqBuilder
  ...

Basically, there are two free variables -- input file type and object
to make.  So this needs some sort of double dispatch mechanism.

(That's not strictly true.  A GenBank specific data type may only
support being built from a GenBank record.  For example, a GenBank
record to HTML converter need only support GenBank.)

Because of the combinitorial explosion, there won't be all that many
generalized intermediate formats.  I can think of perhaps four:
   - a "standard" sequence record
   - a "lightweight" sequence record, when FASTA-style data is enough
       (If the tag names and semantics are consistent across the
        different formats, this can be nearly trivial.)
   - an alignment record
   - some sort of structure data type.


Since there is (will be) format detection, there needs to be some way
to determine the right builder given only the requested output type.
The implementation is something like this:

def readFile(class_to_build, infile):
    format = set_of_allowed_possibilities.recognizeFile(infile)
    iterator = format.make_iterator("record")
    Builder = figure_out_builder(format, class_to_build)
    for record in iterator.parseFile(open(filename), Builder()):
        yield record.document

so someone can say

from Bio import SeqRecord, IO

for record in IO.readFile(SeqRecord.SeqRecord, open("unknown.dat")):
   do_something(record)

(there should also be a readString, for symmetry with the XML code in
Martel.)


I think the best way to implement 'figure_out_builder' is to
ask the class for it, perhaps via a static class method.

   class_to_build.get_builder(format)

then this requires either a registration system or some way to
determine the builder's location as a module.

(eg, the Builder to convert a "swissprot/version=38" format into
a SeqRecord could be returned by calling
  Bio.bioformats.swissprot.SeqRecord.get_builder({"version": "38"})
)


Another way to do the API is to make 'readFile' a static method
of the SeqRecord object.  This gets rid of the 'IO' module.

from Bio import SeqRecord
for record in SeqRecord.SeqRecord.readFile(open("unknown.dat")):
   do_something(record)

This looks funny to me, especially since Python doesn't really have
static methods.  Python 2.2 makes them easier to write.  A third
option is to use a function in the module namespace, as in

from Bio import SeqRecord
for record in SeqRecord.readFile(open("unknown.dat")):
   do_something(record)

This is probably the most traditional and appropriate solution.  On
the other hand, the functionality can't be added automatically through
inheritance, which makes it harder to remember what to do.  There will
need to be an explicit creation of the function, as in

from Bio import IO

readFile = IO.ReadFile(SeqRecord)


Expanding even further, perhaps there should be an "io" object, with
this and the write methods (next email):

from Bio import SeqRecord
for record in SeqRecord.io.readFile(open("unknown.dat")):
   do_something(record)


My problem is that I know this is a double dispatch problem, but I
don't know the right way to solve it.  I can think of many - perhaps
too many. :(


                                Andrew
                                dalke@dalkescientific.com


From adalke at mindspring.com  Fri Dec 21 06:03:09 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] writing
Message-ID: <003d01c18a0f$143bf220$0301a8c0@josiah.dalkescientific.com>

Only one more email after this!  (And it's a summary.)

The opposite to reading is writing.

I want to make file conversion easy.  Here's the example in Bioperl's
SeqIO perldoc:

  $format1 = shift;
  $format2 = shift || die "Usage: reformat format1 format2 < input >
output";

  use Bio::SeqIO;

  $in  = Bio::SeqIO->newFh(-format => $format1 );
  $out = Bio::SeqIO->newFh(-format => $format2 );
  print $out $_ while <$in>;

It should be just as easy for Biopython -- even easier since we have
autodetection.

  import sys
  from Bio import SeqRecord
  if sys.argv != 2:
    sys.exit("Usage: reformat output_format < input > output")

  writer = SeqRecord.make_writer(sys.argv[1])
  for record in SeqRecord.readFile():
      writer.write(record)

(Same number of lines, about the same number of characters, and
I could have done
  map(SeqRecord.make_writer(sys.argv[1]).write, SeqRecord.readFile())
instead of the last three lines :)


Again, there needs to be some resolution system, to figure out the
output converter associated with a given format name.  There's a twist
here that Bioperl doesn't capture - versions.  People are going to
want the output in "swissprot" version and there may be support for
writing it in "swissprot/version=38" and "swissprot/version=39"
versions, so something needs to figure out that 39 is probably better
than 38 (or force the user to disambigute).


There are a few other things I haven't figured out here.

I make the writer with 'make_writer'.  This is a function in the
SeqRecord module scope.  It looks like this:

  def make_writer(output_format = "fasta", outfile = sys.stdout):
    ...

The 'Writer' object created writes SeqRecord objects in the correct
format, on the given file handle.  I am somewhat worried that finer
control may be needed, eg, for "minimal" vs. "complete" output
generation.  I decided to defer worrying until there is more than one
output generator for a given format.

I am not sure that "write" is the appropriate method name.  There's
something to be said for "append", since that's the opposite of
iteration.  Ie

results = []
for x in data:
  results.append(x)

has exactly the same functional form as

writer = make_writers()
for x in data:
  writer.write(x)

It's also possible that some writers will return strings, rather than
write to a file, as in

convert = toString(output_format)
for x in data:
  sys.stdout.write(convert(x))

In this case you can see that 'write' in Python traditionally
takes a string, not an object.

On the other hand, it isn't obvious that 'append' is how to write a
record, and nearly everyone will be writing them.

I'm still thinking about that "io" object, used like this

  writer = SeqRecord.io.make_writer(sys.argv[1])
  for record in SeqRecord.io.readFile():
      writer.write(record)

That makes it easier to standardize the interface, since integration
is then a matter of:

io = StandardIOFramework(SeqRecord)

and 'io' can have

io.register_reader(format, builder)
io.register_writer(format, writer)
builder = io.resolve_reader(format)
writer = io.resolve_writer(format)
for record in io.readFile(open("something.txt")):
  ...
for record in io.readString("SFSDFSDFSDF"):
  ...


                                Andrew
    dalke@dalkescientific.com


From adalke at mindspring.com  Fri Dec 21 06:04:00 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] parsing summary
Message-ID: <003e01c18a0f$31cc4e20$0301a8c0@josiah.dalkescientific.com>

To summarize:

I'm working on a way to minimize the amount of work needed to handle
the standard case of

for record in data_file:
  do_something(record)
  write record to output_file

I think I have an API, which is easy to use

from Bio import SeqRecord

writer = SeqRecord.io.make_writer("genbank")
for record in SeqRecord.io.readFile(open("unknown.dat")):
  do_something(record)
  writer.write(record)

and can handle different intermediate data types

from Bio import SimpleSeq

writer = SimpleSeq.io.make_writer("fasta")
for record in SimpleSeq.io.readFile(open("unknown.dat")):
  do_something(record)
  writer.write(record)

And it's all built on powerful lower-level forms which are still
relatively easy to use.

The biggest problem I have is in registeration of all the different
format and conversion types.  Ideally, added a new format shouldn't
affect performance until its presence is needed.  That speaks for some
sort of file-based discovery mechanism.  The simplest solution is to
load all files at once, but I expect that to yield poor performance.
So there needs to be some sort of defered loading mechanism.  Or at
least such a mechanism should not be precluded.


What I want to do requires coming up with standardized names and data
types.  These include file formats, field types, and data structures.

Thank you for letting me write all this.  It's helped clear
up what my bottlenecks are in this work.  Hopefully you all have
some ideas - or you can way I'm trying to be too clever for my
own good !

                                Andrew
                                dalke@dalkescientific.com


From jchang at smi.stanford.edu  Thu Dec 20 22:40:35 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Re: [BioPython] parse IPI data with biopythons SwissProt parser
In-Reply-To: <3C1F0A9B.9080803@proceryon.at>
References: <3C1F0A9B.9080803@proceryon.at>
Message-ID: <20011220214035.C1338@krusty.dsl.hstntx.swbell.net>

(moved from biopython)

Hi Wolfgang,

These look like relatively minor changes.  I'd like to incorporate
them into the SProt.py file in the standard distribution, if you don't
mind.  However, I'm having a little bit of trouble reconstructing the
patch from the description given.  Do you mind sending me your
SProt.py file with all the changes necessary?

Thanks,
Jeff


On Tue, Dec 18, 2001 at 10:21:31AM +0100, Wolfgang Schueler wrote:
> Hi all,
> 
> the IPI database at EBI contains proteins from the human genome
> from SWISS-PROT, TrEMBL, RefSeq and Ensembl and is available in a 
> SWISS-PROT format.
> Nevertheless there are minor differences to real SWISS-PROT data which 
> prevent the use of the SWISS-PROT parser of Biopython1.00.a3
> 
> The following modifications of Sprot.py allowed the parsing of the 
> IPI-data (find IPI in http://www.ebi.ac.uk/IPI/IPIhelp.html).
> 
> Maybe it is helpful for someone.
> Wolfgang
> 
> 
> 
> 
> # ws: changes in _RecordConsumer.date()            for IPI
> #                _RecordConsumer.identification()  for IPI
> #                _Scanner.scanReference()          crashing SwissProt entry
> #                _Scanner.scanDT()                 for IPI
> #                _Scanner.scanDE()                 for IPI
> 
>     def _scan_dt(self, uhandle, consumer):
>         self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
> #        self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
> #ws:2001-12-05----------------------------------------v========v----  # 
> IPI does not use 'last annotation update'	
>         self._scan_line('DT', uhandle, consumer.date, one_or_more=1)  #
> # 
>                                               ^========^------#
>  #       self._scan_line('DT', uhandle, consumer.date, exactly_one=1) #
> #^--------------------------------------------------------------------# 
> 
> 
>     def _scan_de(self, uhandle, consumer):
> #ws:2001-12-05-----------------------------------------------v========v---- 
>  # IPI IPI00029727.2: no DE entry	
>         self._scan_line('DE', uhandle, consumer.description, 
> any_number=1) # was one_or_more
> #------------------------------------------------------------^========^
>     def _scan_reference(self, uhandle, consumer):
>         while 1:
>             if safe_peekline(uhandle)[:2] != 'RN':
>                 break
>             self._scan_rn(uhandle, consumer)
>             self._scan_rp(uhandle, consumer)
>             self._scan_rc(uhandle, consumer)
>             self._scan_rx(uhandle, consumer)
> # ws:2001-12-05 added, entry exists with RL before RA
> # ----------v==============================v
>             self._scan_rl(uhandle, consumer)
> #-----------^==============================^ 
> 
>             self._scan_ra(uhandle, consumer)
>             self._scan_rt(uhandle, consumer)
>             self._scan_rl(uhandle, consumer)
> 
> 
>     def identification(self, line):
>         cols = string.split(line)
>         self.data.entry_name = cols[1]
>         self.data.data_class = self._chomp(cols[2])    # don't want ';'
>         self.data.molecule_type = self._chomp(cols[3]) # don't want ';'
>         self.data.sequence_length = int(cols[4])
> 
>         # data class can be 'STANDARD' or 'PRELIMINARY'
> # ws:2001-12-05 added to be IPI conform -------------------------v=====v
>         if self.data.data_class not in ['STANDARD','PRELIMINARY','IPI']:
> # ---------------------------------------------------------------^=====^
>             raise SyntaxError, "Unrecognized data class %s is in 
> line\n%s" % \
>                   (self.data.data_class, line)
>         # molecule_type should be 'PRT' for PRoTein
>         if self.data.molecule_type != 'PRT':
>             raise SyntaxError, "Unrecognized molecule type %s in 
> line\n%s" % \
>                   (self.data.molecule_type, line)
> 
>     def date(self, line):
>         uprline = string.upper(line)
>         if string.find(uprline, 'CREATED') >= 0:
>             cols = string.split(line)
> # ws:2001-12-05 added lines to prevent crash at (IPIrel. , created) !no 
> number given!
>             if self._chomp(cols[3]) == '':                            #<=
> 	       self.data.created = cols[1], 0                         #<=
> 	    else:	                                              #<=
>                self.data.created = cols[1], int(self._chomp(cols[3]))
> #-----------^=^--------------------------------------------------------
>         elif string.find(uprline, 'LAST SEQUENCE UPDATE') >= 0:
>             cols = string.split(line)
> # ws:2001-12-05 added lines to prevent crash at '(IPIrel. , created)' 
> !no number given!
>             if self._chomp(cols[3]) == '': 
>        #<=
> 	       self.data.sequence_update = cols[1], 0                         #<=
> 	    else:                                                             #<=
>                self.data.sequence_update = cols[1], 
> int(self._chomp(cols[3]))
> #-----------^=^----------------------------------------------------------------
>         elif string.find(uprline, 'LAST ANNOTATION UPDATE') >= 0:
>             cols = string.split(line)
> # ws:2001-12-05 added lines to prevent crash at '(IPIrel. , created)' 
> !no number given!
>             if self._chomp(cols[3]) == '': 
>        #<=
> 	       self.data.annotation_update = cols[1], 0                       #<=
> 	    else:                                                             #<=
>                self.data.annotation_update = cols[1], 
> int(self._chomp(cols[3]))  #<=
> #-----------^=^----------------------------------------------------------------
>         else:
>             raise SyntaxError, "I don't understand the date line %s" % line
> 
> 
> _______________________________________________
> BioPython mailing list  -  BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython

From katel at worldpath.net  Thu Dec 27 18:59:57 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel question
Message-ID: <004701c18f32$98a3e860$010a0a0a@cadence.com>

  I'm not sure what is wrong with my saf format.  I used my usual approach
when I'm stuck, put it aside for a few days and revisit it.  But I'm still
puzzled. My instrumentation shows its picking up a tag of "#\nBovine when is
should pick up a comment line, then "Bovine"..  Like a restriction enzyme
cutting at the the wrong site. Could be a Dos issue?

 Andrew, can I send an attachment?

                                                                    Cayte


From katel at worldpath.net  Thu Dec 27 23:16:28 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] for the nfinite Martel request  queue
Message-ID: <006301c18f56$6f152260$010a0a0a@cadence.com>

  In porting the ecell script from perl to python, I'd have to add case
restrictions or sprinkle the  code with
structures like


       Alt ( Str( "reactor", "Reactor", "REACTOR" ) )

  This is because perl has a one letter case insensitive option.

                                            Cayte


From adalke at mindspring.com  Sun Dec 30 08:37:12 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Bioformat module
Message-ID: <001201c19137$16a315a0$0201a8c0@josiah.dalkescientific.com>

Hey all,

  Here's the first go at a module based off the set of emails
I wrote last week.  It's at
  http://www.biopython.org/~dalke/Bioformat-0.1.tar.gz

No setup.py or anything fancy like that.  Though you do
need the lastest version of CVS Martel.  (List of changes
in the next email.)

In theory this provides a platform for:
  - automatic format recognition
  - using the format information to build a data structure
  - writing that data structure to another format

For example, these parts can be put together for simple,
generic file conversion, as in:

from Bio import SeqRecord

writer = SeqRecord.io.make_writer(sys.stdout, "fasta")
writer.writeHeader()  # needed for some formats
for record in SeqRecord.io.read(open("file.unknown")):
  writer.write(record)
writer.writeFooter()  # needed for some formats

(Actually, with the code as-is, this is done with

from Bioformat import IO
IO.io.convert(infile = open("file.unknown"), output_format = "fasta")

:)

The README includes some examples of how to use this module.
Please take a look.

More after I have a chance to get some sleep.  This project
was harder than I thought it would be.  OTOH, it's something
that should be very exciting for the O'Reilly conference.

                    Andrew
                    dalke@dalkescientific.com


From adalke at mindspring.com  Sun Dec 30 08:37:26 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] Martel changes
Message-ID: <001301c19137$1eca7ca0$0201a8c0@josiah.dalkescientific.com>

I needed to make a few changes to Martel to support the
Bioformat module I was working on.  They are:

INCOMPATIBLE CHANGE:

 - the record readers now support attributes as well as a
tag name.  I forgot to make those changes last summer.

This only affects HeaderFooter and ParseRecord formats.
I couldn't figure out a nice way to make the API backwards
compatible, so used my "it isn't 1.0" perogative.  This
affected a couple of the existing Biopython modules (needed
to add a {}).  I fixed them up and all the regressions pass.

I was able to make the change in such a way that code using
the old API dies immediately, and it includes a hint on what
needs to be changed.


Other changes:
  - a few speed tweaks to the iterator code; my test case of
reading a subset of sprot38 into a SeqRecord object is now
10% faster.  (The 'characters' callback is used a lot, so
I shorted it's path.)

 - the default iterator boundary tag is 'record'

 - it's possible for an expression to go to completion but
allow some text to remain unparsed.  This now throws a new
exception (a subtype of the old one) to allow the handlers
to do something different for that case.  This is used for
the Bioformat format recognition code.

 - Martel.SimpleRecordFilter is used by the Bioformat code
to write a quick test filter, to determine if more
identification work should be done.

                    Andrew
                    dalke@dalkescientific.com


From katel at worldpath.net  Sun Dec 30 21:39:05 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:08 2005
Subject: [Biopython-dev] FilteredReader
Message-ID: <002801c191a4$54bfbb00$010a0a0a@cadence.com>

  I added FilteredReader to prefilter text before  passing to Martel.  The
ECell input allows blank lines just about anywhere.  Rather than putting
Alt( blank_line, read_line ) everywhere, I wrote a filter.  To make the
routine general, it contains a variable called filter_chain.  The user can
set it to a list of any low level filters that
have the same signature as the default filters.

 Hopefully, I'm not duplicating Andrews' SimpleRecordFilter?

                                      Cayte


From adalke at mindspring.com  Sun Dec 30 18:54:03 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] for the nfinite Martel request  queue
Message-ID: <001801c1918d$431a5520$0201a8c0@josiah.dalkescientific.com>

Cayte:
>  In porting the ecell script from perl to python, I'd have to add case
>restrictions or sprinkle the  code with
>structures like
> Alt ( Str( "reactor", "Reactor", "REACTOR" ) )

Yeah, I've been using Re("[Rr][Ee][Aa][Cc][Tt][Oo][Rr]") for
this, but it's cumbersome to do that by hand.

There is a stubbed 'Martel.NoCase' which will eventually
support this.  Looks like it just needs to duplicate the
expression then replace Str, Any, and Literal terms.  Shouldn't
be too hard, and the result will let you do

  NoCase(Str("reactor"))

                    Andrew
                    dalke@dalkescientific.com


From jchang at smi.stanford.edu  Mon Dec 31 02:18:18 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] format autodection
In-Reply-To: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com>
References: <003a01c18a0e$f4bd33a0$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011230231818.F1032@krusty.stanford.edu>

On Fri, Dec 21, 2001 at 04:02:17AM -0700, Andrew Dalke wrote:

> 2) Does the word "recognize" make sense in this context?  I tried
> "identifier" but that's also a commonly used noun.  (I choose
> "recognize" from a post of Thomas's from the end of summer.)q

I was a confused with what was going on in the code until I realized
that there's actually two slightly different uses of the word
"recognize."  In the first use,
> def _recognizeFile(parser, infile):
recognize is used as a predicate for whether the parser can handle the
format of the data in infile.

In the second,
> class RecognizeFormats:
> [...]
>   def recognizeFile(self, infile):
recognize selects between multiple formats and returns the appropriate
one for the data.

It would clear things up if one of them were renamed something else,
e.g. the first use is renamed as "handlesFile" or "acceptsFile".


> 6) Version detection depends on tell/seek working.  There needs to be
> a simple wrapper for inputs (like URLs, and sys.stdin) which don't
> support that action.  Jeff added something like this already.

The file-like handle in File.py is incomplete for this purpose.  It
can only push back stuff as lines, and not as other blocks of data.
It should not be hard to add that capability, though.

> 8) Does this idea make sense to others?

Yes!  And it's sorely needed!  :)

Jeff

From jchang at smi.stanford.edu  Mon Dec 31 02:32:12 2001
From: jchang at smi.stanford.edu (Jeffrey Chang)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] record iteration
In-Reply-To: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com>
References: <003b01c18a0e$fa1463a0$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20011230233211.G1032@krusty.stanford.edu>

On Fri, Dec 21, 2001 at 04:02:26AM -0700, Andrew Dalke wrote:
> So my proposal is to standardize on certain tag names, to be shared
> across all of the Biopython/Martel grammars.  These include:
>   dataset
>   record
>   header
>   footer

[...]

> BTW, those standard tag names should also include
>   primary_id
>   description (free-form text)
>   sequence   (single letter codes)
>   sequence3  (three letter code)
>   xref       (cross reference to another database)
>   ... others?

This all looks good.  Do you have a sense on whether there should be a
unique prefix or suffix to indicate that a standardized name is being
used?  e.g.  _m_dataset or something like that.  Since these names are
pretty common, especially in this domain, it might be easy to use a
standard tag name when none was intended...

Jeff

From adalke at mindspring.com  Mon Dec 31 05:28:17 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] for the nfinite Martel request  queue
Message-ID: <001201c191e5$dd359860$0201a8c0@josiah.dalkescientific.com>

Me:
>the result will let you do
>
>  NoCase(Str("reactor"))

Implemented.

>>> from Martel import *
>>> print NoCase(Str("reactor = ") + Re("[A-D]"))
[Rr][Ee][Aa][Cc][Tt][Oo][Rr] = [A-Da-d]
>>>

                    Andrew


From adalke at mindspring.com  Mon Dec 31 05:36:24 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] Bioformat module
Message-ID: <001701c191e6$ff3bcfa0$0201a8c0@josiah.dalkescientific.com>

Okay, I cleaned up the code and added support for the
embl65 format.  After fixing some bugs, I just dropped
in the new format definitions and .. poof! I was building
SeqRecords and writing FASTA.

Code is at http://www.biopython.org/~dalke/Bioformats-0.2.py

There's a small bit more cleanup to do.  And documentation.

I think it's at the stage where the code can be added
to Biopython proper.  I would like someone else to
take a look at it first, if only to try it out.  (It
wouldn't hurt to also say "Wow! That's cool!" :)

Next is to work on writing format definitions
with tags that meet some sort of API.  It really is cool
that I could just drop in the embl format definition,
which (with a minor change) met the minimal API needed
to build SeqRecords - and have everything just work.

                    Andrew


From adalke at mindspring.com  Mon Dec 31 05:50:08 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] format autodection
Message-ID: <002201c191e8$ea5713e0$0201a8c0@josiah.dalkescientific.com>

Jeff:
>I was a confused with what was going on in the code until I realized
>that there's actually two slightly different uses of the word
>"recognize."

After some early attempts, I decided that "recognize" just
wasn't the right word to use.  I've decided to use "identify",
and my solution to the confusion in words is that the
identify returns a 'Format'.

format = Bioformat.identify(open("file.dat"))
if format is not None:
    print format.name


>  In the first use,
>> def _recognizeFile(parser, infile):
>recognize is used as a predicate for whether the parser can handle the
>format of the data in infile.

I've kept that usage internally.

>In the second,
>> class RecognizeFormats:
>> [...]
>>   def recognizeFile(self, infile):
>recognize selects between multiple formats and returns the appropriate
>one for the data.

This form is now known as 'identify'

I wasn't explicitly aware of the distinction, but what happened
to me was it didn't scan well in English.  I wrote some sample
code and tried to make the names fit the way I decribed what
was going on.  I ended up with:
   "I want to identify the format used"
   "First, we see if this recognizes the format"


>It would clear things up if one of them were renamed something else,
>e.g. the first use is renamed as "handlesFile" or "acceptsFile".

Done.

>The file-like handle in File.py is incomplete for this purpose.  It
>can only push back stuff as lines, and not as other blocks of data.
>It should not be hard to add that capability, though.

Yeah, I saw that.  I've included a 'ReseekFile' which buffers
everything read, and allows reseeking to the original position
(and only the original position).  It only supports the 'read'
method, since that's all Martel needs.  I only allows tells()
at the beginning, and only allows seeks to that position.

It has new method called 'nobuffer', which clears the buffer
after it's all been (re)read.  This prevents the ReseekFile
from storing everything even after the file has been parsed.

>> 8) Does this idea make sense to others?
>
>Yes!  And it's sorely needed!  :)

Thanks!  Now, take a look at the code to see what the result
looks like :)

                    Andrew


From adalke at mindspring.com  Mon Dec 31 06:03:37 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:09 2005
Subject: [Biopython-dev] record iteration
Message-ID: <002f01c191ea$ccbb8260$0201a8c0@josiah.dalkescientific.com>

Me, on standardized Martel tag names.

Jeff:
>This all looks good.  Do you have a sense on whether there should be a
>unique prefix or suffix to indicate that a standardized name is being
>used?

Well, it's XML so the best solution is probably XML namespaces.
They would look like this:
  biopython:sequence

but I don't know enough about them.  I've only been doing
non-namespace work.  To make things more fun, the SAX 2.0 API
is slightly different for namespace tags as compared to
non-namespace tags, and I don't know how they are supposed
to work.  The documentation I have is pre-2.0, and my
"Python and XML" book hasn't arrived yet.

> e.g.  _m_dataset or something like that.  Since these names are
>pretty common, especially in this domain, it might be easy to use a
>standard tag name when none was intended...

My plan is several-fold:
 - make up my own tag names, since it's just us for now
 - document when/how they are used
 - look for existing tag names (BTW, I don't understand GAME
     all that well)
 - convince others to use Martel, or at least the ideas behind
    Martel (if no one else uses it, there won't be a namespace
    problem)
 - keep going blithely until there is a problem and/or until
    someone tells me how to use SAX 2.0-style namespaces
 - present the general Martel/parsing idea in my Biopython talk
     at the O'Reilly conference
 - bring up the specific problem in a lightning talk
 - convince others at the hackathon to help me, as I don't have
     enough breadth to get things right.

Going to be a busy January. :)

                    Andrew