From dalke at acm.org  Thu Nov  2 23:36:52 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] unicode and sequences
Message-ID: <014801c0454f$b1b47320$19ac323f@josiah>

I was browsing through how Unicode works in Python 2.0.  I found
it interesting that it's similar to how the biopython sequence class
works, in that unicode strings take a sequence of bytes and an optional
encoding, just like the sequence takes a string of bytes and and an
alphabet.

There is a difference - I think the unicode code converts everything
into UTF-8 encoded Unicode.  Still, I liked the similarity so wanted
to point it out :)

                    Andrew


From katel at worldpath.net  Sat Nov  4 21:04:13 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] MaxRepeat
References: <014801c0454f$b1b47320$19ac323f@josiah>
Message-ID: <000b01c046cc$b34218e0$010a0a0a@cadence.com>

  My units tests for MaxRepeat, with one parameter,  failed.  I think the
problem is that the tech description shows the lower limit with a default of
0.  The code has no default for the lower limit.


                                                    Cayte


From katel at worldpath.net  Sun Nov  5 18:41:13 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Assert in Martel
Message-ID: <000901c04781$e340c540$010a0a0a@cadence.com>

  The value in the invert parameter is undefined for values other than 0 or
1.  2 acts like 0, 3 acts like 1.

                                Cayte


From katel at worldpath.net  Sun Nov  5 20:43:04 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel Test Cases
Message-ID: <000901c04792$e88065e0$010a0a0a@cadence.com>

  Today was a cold, rainy Sunday, perfect for coding and testing. :)  I
committed a bunch of test cases for Martel.  I should add more for
Group/GroupRef and some for the operator overloads.

  My experience that the unit test tool works well for this kind of test but
not for parsers, where you'd have to drag a lot of context around.

  Should we be thinking about a Gui for the new parsers?


                                               Cayte


From dalke at acm.org  Sun Nov  5 18:52:39 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Re: MaxRepeat
Message-ID: <027801c04783$7c705b80$b9ab323f@josiah>

Cayte:
>I think the problem is that the tech description shows the lower limit
>with a default of 0.  The code has no default for the lower limit.

You're right.  It should have a default min value of 0.

>  The value in the invert parameter is undefined for values other than
> 0 or 1.  2 acts like 0, 3 acts like 1.

Can you give me a test case?  The "invert" option should be a test for
true/false using the Python definition of 0, [], (), {}, all being false
(as well as anything where __nonzero__ or __len__ returns 0) and 1, 2, 3,
...
should all be considered true.

I looked over the code for how invert is used.  There are a few problems
with it, like using 'exp.invert == 0' instead of 'not exp.invert', so
I'll fix those as well, but they won't give the behaviour you're talking
about.

> Today was a cold, rainy Sunday, perfect for coding and testing.

You sure you aren't in Santa Fe? :)  We've had some snow in town, although
it's about 40 degrees so it isn't sticking.  The freeze level seems to
be about 500 feet above us, at least, that's where the snow line appears
on the montains.  Guess winter is settling on everyone - 'cept my family
in Florida.

> My experience that the unit test tool works well for this kind of test
> but not for parsers, where you'd have to drag a lot of context around.

I got that feeling as well.  That's why there ended up being a lot of
test scaffolding for my regression tests, and my tests don't even check
to see if the code matched the right things (they just check to see if
it matched *something*).

I've been tempted to have a set of golden data, with a bunch of data
files for each grammer, then converting the parsed result to a canonical
XML form and comparing the result to the gold reference.

I haven't gone towards it since I get the feeling that that sort of
regression code is too fragile for code still under development.

> Should we be thinking about a Gui for the new parsers?

I'm not sure what that means.  Something like the Tools/redemo.py in the
Python distribution?  That is, a window with two text regions, one to
build the regexp and one containing the text to match.  The regions that
match can be highlighted, perhaps with a mouseover to show the tag name
for a given region.  Umm, that won't work since there can be many tags
describing a region.

                    Andrew


From katel at worldpath.net  Mon Nov  6 01:20:33 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Re: MaxRepeat
References: <027801c04783$7c705b80$b9ab323f@josiah>
Message-ID: <002001c047b9$ac38cf60$010a0a0a@cadence.com>

  I added more test cases.  test_n2 fails, probably something weird about
backslash.  I can't post the test case, because MIME won't take it, but I
checked  it in.


>
> >  The value in the invert parameter is undefined for values other than
> > 0 or 1.  2 acts like 0, 3 acts like 1.
>
> Can you give me a test case?  The "invert" option should be a test for
> true/false using the Python definition of 0, [], (), {}, all being false
> (as well as anything where __nonzero__ or __len__ returns 0) and 1, 2, 3,
> ...
> should all be considered true.
>
  I can't reproduce it.  Write it off as eyestrain.:)


                                             Cayte


From chapmanb at arches.uga.edu  Tue Nov  7 04:45:07 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <Pine.GSO.4.21.0010311853280.16965-100000@riboweb.Stanford.EDU>
References: <14844.19384.149965.283975@taxus.athen1.ga.home.com>
	<Pine.GSO.4.21.0010311853280.16965-100000@riboweb.Stanford.EDU>
Message-ID: <14855.53027.829310.153321@taxus.athen1.ga.home.com>

Jeff:
> Sure.  Having some code that would help to diagnose errors in BLAST
> reports would be a very nice feature.  Certainly more user friendly than
> having SyntaxError this or SyntaxError that.
> 
> We would have to build this on top of the current exceptions, though.  
> It's still nice to have the SyntaxErrors under the hood, as an explanation
> on why the parser is complaining in the first place.

Okay, I went ahead and tried to implement something to do what we are
talking about. The code is attached as a diff to the current
NCBIStandalone module. Basically, what I did was implement a class
BlastErrorParser that uses the regular BlastParser, but catches
SyntaxErrors and tries to figure out the problems with them. It will
also optionally save any BLAST reports that cause syntax errors to a
file (which I think is a useful feature if you want to look at the
records that are causing the errors in a big ol' file of BLAST
results).

I use copy.deepcopy() to copy the handle, and since I was curious
about how this would affect the parsing time, I did a little timing
test. This wasn't anything scientific or anything, just a big BLAST
report that I had to parse which had errors in it. The results are:

using BlastErrorParser -> 1 hour and 31 minutes
Starting parsing at: Mon Nov  6 22:38:32 2000
Stopped parsing at: Tue Nov  7 00:09:04 2000

using BlastParser -> 1 hour and 30 minutes
Starting parsing at: Tue Nov  7 00:37:56 2000
Stopped parsing at: Tue Nov  7 02:07:57 2000

So I guess the overhead is minimal, and this makes me happy -- if
anyone else knows more about timings and wants to do tests, I would be 
happy to hear about them.

Anyways, this does everything I was originally writing about wanting
to happen, and I like it, but I'd like to hear people's opinions and
comments on it. If people are for including it, then I can check it in 
and also add a test that uses it to the regression tests.

Thanks for all the input on this so far!

Brad

-------------- next part --------------
*** NCBIStandalone.py.orig	Thu Oct 12 13:32:21 2000
--- NCBIStandalone.py	Mon Nov  6 22:28:16 2000
***************
*** 36,41 ****
--- 36,42 ----
  import re
  import popen2
  from types import *
+ import copy
  
  from Bio import File
  from Bio.ParserSupport import *
***************
*** 471,476 ****
--- 472,563 ----
  
          consumer.end_parameters()
  
+ class LowQualityBlastError(Exception):
+     """Error caused by running a low quality sequence through BLAST.
+ 
+     When low quality sequences (like GenBank entries containing only
+     stretches of a single nucleotide) are BLASTed, they will result in
+     BLAST generating an error and not being able to perform the BLAST.
+     search. This error should be raised for the BLAST reports produced
+     in this case.
+     """
+     pass
+ 
+ class BlastErrorParser:
+     """Attempt to catch and diagnose BLAST errors while parsing.
+ 
+     This utilizes the BlastParser module but adds an additional layer
+     of complexity on top of it by attempting to diagnose SyntaxError's
+     that may actually indicate problems during BLAST parsing.
+ 
+     Current BLAST problems this detects are:
+     o LowQualityBlastError - When BLASTing really low quality sequences
+     (ie. some GenBank entries which are just short streches of a single
+     nucleotide), BLAST will report an error with the sequence and be
+     unable to search with this. This will lead to a badly formatted
+     BLAST report that the parsers choke on. The parser will convert the
+     SyntaxError to a LowQualityBlastError and attempt to provide useful
+     information.
+     """
+     def __init__(self, bad_report_file = None):
+         """Initialize a parser that tries to catch BlastErrors.
+ 
+         Arguments:
+         o bad_report_file - An optional argument specifying a file to
+         write any reports that raise errors to. If not specified, these
+         reports will not be saved.
+         """
+         self._bad_report_file = bad_report_file
+         # if the report file exists, we want to clear the info in it
+         if self._bad_report_file and os.path.exists(self._bad_report_file):
+             tmp = open(self._bad_report_file, 'w')
+             tmp.close()
+         
+         self._b_parser = BlastParser()
+ 
+     def parse(self, handle):
+         """Parse a handle, attempting to diagnose errors.
+         """
+         # copy the handle so we have it if we find an error
+         copy_handle = copy.deepcopy(handle)
+ 
+         try:
+             return self._b_parser.parse(handle)
+         except SyntaxError, msg:
+             # if we have a bad_report_file, save the info to it first
+             if self._bad_report_file:
+                 # copy the handle so we can write it
+                 error_handle = copy.deepcopy(copy_handle)
+                 # append the info to the file
+                 error_file = open(self._bad_report_file, 'a')
+                 error_file.write(error_handle.read())
+                 error_file.close()
+ 
+             # now we want to try and diagnose the error
+             self._diagnose_error(copy_handle, self._b_parser._consumer.data)
+ 
+             # if we got here we can't figure out the problem
+             # so we should pass along the syntax error we got
+             raise SyntaxError, msg
+ 
+     def _diagnose_error(self, handle, data_record):
+         """Attempt to diagnose an error in the passed handle.
+ 
+         Arguments:
+         o handle - The handle potentially containing the error
+         o data_record - The data record partially created by the consumer.
+         """
+         line = handle.readline()
+ 
+         while line:
+             # 'Searchingdone' instead of 'Searching......done' seems
+             # to indicate a failure to perform the BLAST due to
+             # low quality sequence
+             if line[:13] == 'Searchingdone':
+                 raise LowQualityBlastError("Blast failure occured on query: ",
+                                            data_record.query)
+             line = handle.readline()
+             
  class BlastParser:
      """Parses BLAST data into a Record.Blast object.
  
From jchang at SMI.Stanford.EDU  Tue Nov  7 17:47:51 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <14855.53027.829310.153321@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011071404280.11361-100000@riboweb.Stanford.EDU>

> I use copy.deepcopy() to copy the handle

Are you sure you can copy file handles in this way?  It's not working for
me using Python 2.0 on Solaris:
Python 2.0 (#1, Oct 17 2000, 12:05:31) 
[GCC 2.8.1] on sunos5
Type "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIStandalone
>>> parser = NCBIStandalone.BlastErrorParser()
>>> rec = parser.parse(open('bt001'))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/home/jchang/lib/jchang/pylib/Bio/Blast/NCBIStandalone.py", line
1578, in parse
    copy_handle = copy.deepcopy(handle)
  File "/home/jchang/lib/python2.0/copy.py", line 147, in deepcopy
    raise error, \
copy.Error: un-deep-copyable object of type <type 'file'>
>>> 

I'm trying to parse blast test bt001.


+     def __init__(self, bad_report_file = None):
+         """Initialize a parser that tries to catch BlastErrors.
+ 
+         Arguments:
+         o bad_report_file - An optional argument specifying a file to
+         write any reports that raise errors to. If not specified, these
+         reports will not be saved.

Can we make this function take a handle instead of the name of a file?  
That would allow people to use sys.stderr, if they want the bad files to
go to STDERR.  The tradeoff is that it would place the burden of creating
a handle on the client.

Another option is to allow people to pass in either a file name or a
handle.  While I'm not crazy about this, there is at least one instance of
this in Python (see uu.py), and tabnanny.py has a function that takes the
name of either a file or directory.  Perhaps this is a case of
practicality beating purity.

Jeff


From dalke at acm.org  Tue Nov  7 18:55:40 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
Message-ID: <031a01c04916$9ec20f00$89ac323f@josiah>

Jeff:
>Another option is to allow people to pass in either a file name or a
>handle.  While I'm not crazy about this, there is at least one instance of
>this in Python (see uu.py), and tabnanny.py has a function that takes the
>name of either a file or directory.  Perhaps this is a case of
>practicality beating purity.

Guess I'm a purist. (Does that mean I should be using Lisp? :) 
Passing file handles is The Right Thing.

> That would allow people to use sys.stderr

Or a StringIO.  I have the belief that if there's output it should be
useful, and if it's useful, it should be programmatically accessible.
Using file names is awkward and cumbersome, since you have to find
some writable directory (eg, mktemp and all the problems that entails).

I'm currently working with a Python library which uses a lot of file
names instead of handle.  (It evolved from a set of shell scripts.)
It's pretty awkward since I have to wrap everything with functions or
objects which hide that it's referencing a file.

Otherwise, don't mind me - I haven't been following this thread.

                    Andrew
                    dalke@acm.org


From chapmanb at arches.uga.edu  Tue Nov  7 19:30:55 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <Pine.GSO.4.21.0011071404280.11361-100000@riboweb.Stanford.EDU>
References: <14855.53027.829310.153321@taxus.athen1.ga.home.com>
	<Pine.GSO.4.21.0011071404280.11361-100000@riboweb.Stanford.EDU>
Message-ID: <14856.40639.744589.612069@taxus.athen1.ga.home.com>

Me:
> > I use copy.deepcopy() to copy the handle

Jeff checks on me:
> Are you sure you can copy file handles in this way?  It's not working for
> me using Python 2.0 on Solaris:
[]

Ooops -- I should have checked a simplest case. Doh! Thanks for the
good catch. Apparently, copy.deepcopy() can copy your magical
File.StringHandles but not regular ol' file handles. I was just using
the output from the iterator to parse, so I completely missed this. A new
version is attached which should work for this case -- it converts
things that aren't StringHandles to a StringHandle before
proceeding. This way there shouldn't be any extra overhead for using
the iterator, but it can handle taking a simple file.

[BlastErrorParser taking a file to write bad reports to]
Jeff asks:
> Can we make this function take a handle instead of the name of a file?  
> That would allow people to use sys.stderr, if they want the bad files to
> go to STDERR.  The tradeoff is that it would place the burden of creating
> a handle on the client.

Andrew agrees:
> Guess I'm a purist. (Does that mean I should be using Lisp? :) 
> Passing file handles is The Right Thing.

Agreed on all accounts. Biopython does use file handles for almost
everything, so not having a handle here is actually strange and
awkward. I've switched this over in the new attached patch.

Thanks for the comments! Please let me know of anything else at all.

Brad

-------------- next part --------------
*** NCBIStandalone.py.orig	Thu Oct 12 13:32:21 2000
--- NCBIStandalone.py	Tue Nov  7 19:17:35 2000
***************
*** 36,41 ****
--- 36,42 ----
  import re
  import popen2
  from types import *
+ import copy
  
  from Bio import File
  from Bio.ParserSupport import *
***************
*** 471,476 ****
--- 472,563 ----
  
          consumer.end_parameters()
  
+ class LowQualityBlastError(Exception):
+     """Error caused by running a low quality sequence through BLAST.
+ 
+     When low quality sequences (like GenBank entries containing only
+     stretches of a single nucleotide) are BLASTed, they will result in
+     BLAST generating an error and not being able to perform the BLAST.
+     search. This error should be raised for the BLAST reports produced
+     in this case.
+     """
+     pass
+ 
+ class BlastErrorParser:
+     """Attempt to catch and diagnose BLAST errors while parsing.
+ 
+     This utilizes the BlastParser module but adds an additional layer
+     of complexity on top of it by attempting to diagnose SyntaxError's
+     that may actually indicate problems during BLAST parsing.
+ 
+     Current BLAST problems this detects are:
+     o LowQualityBlastError - When BLASTing really low quality sequences
+     (ie. some GenBank entries which are just short streches of a single
+     nucleotide), BLAST will report an error with the sequence and be
+     unable to search with this. This will lead to a badly formatted
+     BLAST report that the parsers choke on. The parser will convert the
+     SyntaxError to a LowQualityBlastError and attempt to provide useful
+     information.
+     """
+     def __init__(self, bad_report_handle = None):
+         """Initialize a parser that tries to catch BlastErrors.
+ 
+         Arguments:
+         o bad_report_handle - An optional argument specifying a handle
+         where bad reports should be sent. This would allow you to save
+         all of the bad reports to a file, for instance. If no handle
+         is specified, the bad reports will not be saved.
+         """
+         self._bad_report_handle = bad_report_handle
+         
+         self._b_parser = BlastParser()
+ 
+     def parse(self, handle):
+         """Parse a handle, attempting to diagnose errors.
+         """
+         if isinstance(handle, File.StringHandle):
+             shandle = handle
+         else:
+             shandle = File.StringHandle(handle.read())
+ 
+         # copy the handle so we have it if we find an error
+         copy_handle = copy.deepcopy(shandle)
+ 
+         try:
+             return self._b_parser.parse(shandle)
+         except SyntaxError, msg:
+             # if we have a bad_report_file, save the info to it first
+             if self._bad_report_handle:
+                 # copy the handle so we can write it
+                 error_handle = copy.deepcopy(copy_handle)
+                 # send the info to the error handle
+                 self._bad_report_handle.write(error_handle.read())
+ 
+             # now we want to try and diagnose the error
+             self._diagnose_error(copy_handle, self._b_parser._consumer.data)
+ 
+             # if we got here we can't figure out the problem
+             # so we should pass along the syntax error we got
+             raise SyntaxError, msg
+ 
+     def _diagnose_error(self, handle, data_record):
+         """Attempt to diagnose an error in the passed handle.
+ 
+         Arguments:
+         o handle - The handle potentially containing the error
+         o data_record - The data record partially created by the consumer.
+         """
+         line = handle.readline()
+ 
+         while line:
+             # 'Searchingdone' instead of 'Searching......done' seems
+             # to indicate a failure to perform the BLAST due to
+             # low quality sequence
+             if line[:13] == 'Searchingdone':
+                 raise LowQualityBlastError("Blast failure occured on query: ",
+                                            data_record.query)
+             line = handle.readline()
+             
  class BlastParser:
      """Parses BLAST data into a Record.Blast object.
  
From katel at worldpath.net  Wed Nov  8 02:36:09 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] platform independence for eol routines in Martel
Message-ID: <001d01c04956$910653e0$010a0a0a@cadence.com>

  os.name gives the python name of the os.  We could have a test and
different handling for posix and nt.

                                          Cayte


From jchang at SMI.Stanford.EDU  Wed Nov  8 02:09:56 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] platform independence for eol routines in Martel
In-Reply-To: <001d01c04956$910653e0$010a0a0a@cadence.com>
Message-ID: <Pine.GSO.4.21.0011072246350.14822-100000@taiyang>

I've always wanted an os.eol variable that's set to the proper end of line
character(s) for your platform.  I think it's been brought up before on
comp.lang.python.  I don't remember why the idea was shot down.

Jeff


On Tue, 7 Nov 2000, Cayte wrote:

>   os.name gives the python name of the os.  We could have a test and
> different handling for posix and nt.
> 
>                                           Cayte
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From dalke at acm.org  Wed Nov  8 03:02:40 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] platform independence for eol routines in Martel
Message-ID: <034301c0495a$5162fd20$89ac323f@josiah>

Jeff:
>I've always wanted an os.eol variable that's set to the proper end of line
>character(s) for your platform.  I think it's been brought up before on
>comp.lang.python.  I don't remember why the idea was shot down.

I think it's because "\n" is always supposed to be newline, and you
need to use the right open flags ("t" instead of "b") to get the
translation.  The "\r", "\r\n", or "\n" is only used in binary mode.

Still, I agree.

                    Andrew


From jchang at SMI.Stanford.EDU  Wed Nov  8 20:40:43 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <14856.40639.744589.612069@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011081735020.14559-100000@riboweb.Stanford.EDU>

Thanks for the updates.

One more thing: Passing the results around as a string would be
essentially the same as doing deepcopies of a StringHandle.  It would save
the overhead of doing a deep copy of an object, and then reading the
results.  The copy module is nice for arbitrary objects that we don't know
about a-priori, but when we only deal with StringHandle's, it's OK to just
create one directly when we need it.

+     def parse(self, handle):
+         """Parse a handle, attempting to diagnose errors.
+         """
+         if isinstance(handle, File.StringHandle):
+             shandle = handle
+         else:
+             shandle = File.StringHandle(handle.read())

would be:

results = handle.read()


+         try:
+             return self._b_parser.parse(shandle)

    return self._b_parser.parse(File.StringHandle(results))


+         except SyntaxError, msg:
+             # if we have a bad_report_file, save the info to it first
+             if self._bad_report_handle:
+                 # copy the handle so we can write it
+                 error_handle = copy.deepcopy(copy_handle)
+                 # send the info to the error handle
+                 self._bad_report_handle.write(error_handle.read())


    if self._bad_report_handle:
        self._bad_report_handle.write(results)


etc


Thanks,
Jeff


From jchang at SMI.Stanford.EDU  Fri Nov 10 20:37:05 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
Message-ID: <Pine.GSO.4.21.0011101634120.17431-100000@riboweb.Stanford.EDU>

Hello everybody,

I'm getting ready to make a 0.90-d04 release of Biopython.  A few things
need to be done before it:

- Brad, I've checked in your BlastErrorParser.  I'm saving the results as
a string instead of a StringHandle.  Could you please look this over and
let me know if this is working and acceptable to you?  Thanks.

- The test_gobase regression tests are failing.  The output from the
test_gobase.py file doesn't match the golden output.  Cayte, could you
look into this?

- The test_prodoc regression tests are failing.  This is mostly my fault,
as the previous version of Prodoc didn't allow copyrights at the end of
records.  However, this has been fixed.  Cayte, do you mind having another
go at the tests, and checking in the verified output?

- The test_align regression tests are failing for me.  It complains that
saxlib is missing.  Do we need to install xmllib for this?  I'm using
Python 2.0.  I thought this came with a SAX api?

- Brad, is your new similarity matrix code ready to check in?

Thanks,
Jeff


From katel at worldpath.net  Sat Nov 11 05:12:32 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
References: <Pine.GSO.4.21.0011101634120.17431-100000@riboweb.Stanford.EDU>
Message-ID: <003301c04bc7$e8c25980$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: <biopython-dev@biopython.org>
Sent: Friday, November 10, 2000 5:37 PM
Subject: [Biopython-dev] 0.90-d04 coming soon...


> Hello everybody,
>
> - The test_gobase regression tests are failing.  The output from the
> test_gobase.py file doesn't match the golden output.  Cayte, could you
> look into this?
>
  I ran test_gobase.py and then ran a diff between the output and the file,
test_gobase in output.  The diff didn't show any differences.  I checked by
eye too.  Can you send me the output that is failing?  Is it plaform
dependent?  Or did you retrieve the test htm files from gobase again?  Maybe
there are changes?  I need your output and input.

                                             Cayte


From chapmanb at arches.uga.edu  Sat Nov 11 07:17:17 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <Pine.GSO.4.21.0011101634120.17431-100000@riboweb.Stanford.EDU>
References: <Pine.GSO.4.21.0011101634120.17431-100000@riboweb.Stanford.EDU>
Message-ID: <14861.14541.155986.608067@taxus.athen1.ga.home.com>

Hi Jeff!

> I'm getting ready to make a 0.90-d04 release of Biopython.  

Great! Do you need any help with anything (besides the points below,
of course :-)? Also, what is the deadline for rolling this? I just
wanted to write some more docs (PubMed and SwissProt are next on the
list, I think). I'm not going to hold you up on this, but if I can get 
them done before the deadline, I'll try to do it.

Also, can we name it '0.90d04' and not '0.90-d04' (ie. no dash in
there). When I was playing with making rpms, rpm was complaining about 
the dash in the name.

> - Brad, I've checked in your BlastErrorParser.  I'm saving the results as
> a string instead of a StringHandle.  Could you please look this over and
> let me know if this is working and acceptable to you?  Thanks.

This looks great, Jeff, thanks much for checking it in! I definately
agree that using the results directly instead of making them into a
StringHandle is much cleaner looking. Thanks again for all of your
suggestions on this.
 
Jeff:
> - The test_gobase regression tests are failing.  The output from the
> test_gobase.py file doesn't match the golden output.  Cayte, could you
> look into this? 

Cayte:
>  I ran test_gobase.py and then ran a diff between the output and 
> the file, test_gobase in output.  The diff didn't show any
> differences.

test_gobase in the regression test also fails for me (although just
running test_gobase.py works fine). My output/test_gobase file (which
should be exactly what is in CVS) looks like:

testing G405967.htm

And that's it, which explains why the regressiont test fails for
me. Cayte, perhaps you have a more recent copy of output/test_gobase 
then what is in CVS?

> - The test_align regression tests are failing for me.  It complains that
> saxlib is missing.  Do we need to install xmllib for this?  I'm using
> Python 2.0.  I thought this came with a SAX api?

I think the saxlib errors are coming from Martel, which is not 2.0
friendly, yet. I attached a patch to Martel-0.3/Martel/Parser.py which 
should make Martel work with only the 2.0 libraries (ie. no need to
install the PyXML package). I believe this should also work with 1.5.2 
with PyXML 0.6.1 installed, but I haven't verified this.

If this patch doesn't fix anything and you still get errors from
test_align, could you send me your trace? I definately want to fix
any problems with it!

> - Brad, is your new similarity matrix code ready to check in?

Well, actually this is all Iddo's code (substitution matrices) --
we've just been talking back and forth about things and working on it
together. But, it is ready to go in. It is sitting in my local copy
working without a problem -- it also has working tests and
documentation (already in Tutorial.tex).

Do you want this to go in the next release? I think the code is
good to go (it gets the Brad-seal-of-approval :-), but it is up to
you. Just give me the word and I can check it in.

Thanks again for getting this together!

Brad


-------------- next part --------------
*** Parser.py.orig	Mon Oct  9 06:41:10 2000
--- Parser.py	Thu Oct 12 20:16:35 2000
***************
*** 30,36 ****
  """
  
  import urllib, pprint
! from xml.sax import saxlib
  import TextTools
  
  try:
--- 30,38 ----
  """
  
  import urllib, pprint
! from xml.sax import xmlreader
! from xml.sax import _exceptions
! from xml.sax import handler
  import TextTools
  
  try:
***************
*** 55,61 ****
  # The SAX startElements take an AttributeList as the second argument.
  # Martel's attributes are always empty, so make a simple class which
  # doesn't do anything and which I can guarantee won't be modified.
! class MartelAttributeList(saxlib.AttributeList):
      def getLength(self):
          return 0
      def getName(self, i):
--- 57,63 ----
  # The SAX startElements take an AttributeList as the second argument.
  # Martel's attributes are always empty, so make a simple class which
  # doesn't do anything and which I can guarantee won't be modified.
! class MartelAttributeList(xmlreader.AttributesImpl):
      def getLength(self):
          return 0
      def getName(self, i):
***************
*** 83,89 ****
          return alternative
  
  # singleton object shared amoung all startElement calls
! _attribute_list = MartelAttributeList()
  
  
  def _do_callback(s, begin, end, taglist, doc_handler):
--- 85,91 ----
          return alternative
  
  # singleton object shared amoung all startElement calls
! _attribute_list = MartelAttributeList([])
  
  
  def _do_callback(s, begin, end, taglist, doc_handler):
***************
*** 128,134 ****
          doc_handler.characters(s, begin, end-begin)
  
  # These exceptions are liable to change in the future
! class StateTableException(saxlib.SAXException):
      """used when a parse cannot be done"""
      pass
  
--- 130,136 ----
          doc_handler.characters(s, begin, end-begin)
  
  # These exceptions are liable to change in the future
! class StateTableException(_exceptions.SAXException):
      """used when a parse cannot be done"""
      pass
  
***************
*** 156,162 ****
  
      # Special case text for the base DocumentHandler since I know that
      # object does nothing and I want to test the method call overhead.
!     if doc_handler.__class__ != saxlib.DocumentHandler:
          # Send any tags to the client (there can be some even if there
          _do_callback(s, 0, pos, taglist, doc_handler)
  
--- 158,164 ----
  
      # Special case text for the base DocumentHandler since I know that
      # object does nothing and I want to test the method call overhead.
!     if doc_handler.__class__ != handler.ContentHandler:
          # Send any tags to the client (there can be some even if there
          _do_callback(s, 0, pos, taglist, doc_handler)
  
***************
*** 168,178 ****
          return None
  
  # This needs an interface like the standard XML parser
! class Parser(saxlib.Parser):
      """Parse the input data all in memory"""
  
      def __init__(self, tagtable, want_groupref_names = 0):
!         saxlib.Parser.__init__(self)
  
          assert type(tagtable) == type( () ), "mxTextTools only allows a tuple tagtable"
          self.tagtable = tagtable
--- 170,180 ----
          return None
  
  # This needs an interface like the standard XML parser
! class Parser(xmlreader.XMLReader):
      """Parse the input data all in memory"""
  
      def __init__(self, tagtable, want_groupref_names = 0):
!         xmlreader.XMLReader.__init__(self)
  
          assert type(tagtable) == type( () ), "mxTextTools only allows a tuple tagtable"
          self.tagtable = tagtable
***************
*** 206,239 ****
          XXX will be removed with the switch to Python 2.0, where parse()
          takes an 'InputSource'
          """
!         self.doc_handler.startDocument()
  
          if self.want_groupref_names:
              _match_group.clear()
  
          # parse the text and send the SAX events
!         result = _parse_elements(s, self.tagtable, self.doc_handler)
  
          if result is None:
              # Successful parse
!             self.doc_handler.endDocument()
!             return
  
!         elif isinstance(result, saxlib.SAXException):
              # could not parse record, and wasn't EOF
!             self.err_handler.fatalError(result)
!             return
          
          else:
              # Reached EOF
              pos = result
!             self.err_handler.fatalError(StateTableEOFException(pos))
!             return
  
      def close(self):
          pass
  
! class RecordParser(saxlib.Parser):
      """Parse the input data a record at a time"""
      def __init__(self, format_name, record_tagtable, want_groupref_names,
                   make_reader, reader_args = ()):
--- 208,241 ----
          XXX will be removed with the switch to Python 2.0, where parse()
          takes an 'InputSource'
          """
!         self._cont_handler.startDocument()
  
          if self.want_groupref_names:
              _match_group.clear()
  
          # parse the text and send the SAX events
!         result = _parse_elements(s, self.tagtable, self._cont_handler)
  
          if result is None:
              # Successful parse
!             pass
  
!         elif isinstance(result, _exceptions.SAXException):
              # could not parse record, and wasn't EOF
!             self._err_handler.fatalError(result)
          
          else:
              # Reached EOF
              pos = result
!             self._err_handler.fatalError(StateTableEOFException(pos))
! 
!         # send an endDocument event even after errors
!         self._cont_handler.endDocument()
  
      def close(self):
          pass
  
! class RecordParser(xmlreader.XMLReader):
      """Parse the input data a record at a time"""
      def __init__(self, format_name, record_tagtable, want_groupref_names,
                   make_reader, reader_args = ()):
***************
*** 249,255 ****
          reader_args - optional arguments to pass to make_reader after the
                input file object
          """
!         saxlib.Parser.__init__(self)
          
          self.format_name = format_name
          assert type(record_tagtable) == type( () ), \
--- 251,257 ----
          reader_args - optional arguments to pass to make_reader after the
                input file object
          """
!         xmlreader.XMLReader.__init__(self)
          
          self.format_name = format_name
          assert type(record_tagtable) == type( () ), \
***************
*** 272,305 ****
          """
          reader = apply(self.make_reader, (fileobj,) + self.reader_args)
  
!         self.doc_handler.startDocument()
          
          if self.want_groupref_names:
              _match_group.clear()
          
!         self.doc_handler.startElement(self.format_name, _attribute_list)
          filepos = 0  # XXX can get mixed up with DOS style "\r\n"
          while 1:
              record = reader.next()  # XXX what if an exception is raised?
              if record is None:
                  break
!             result = _parse_elements(record, self.tagtable, self.doc_handler)
              if result is None:
                  # Successfully read the record
                  continue
!             elif isinstance(result, saxlib.SAXException):
                  # Wrong format
!                 self.err_handler.fatalError(result)
                  return
              else:
                  # did not reach end of string
                  pos = filepos + result
!                 self.err_handler.fatalError(StateTableEOFException(pos))
  
              filepos = filepos + len(record)
  
!         self.doc_handler.endElement(self.format_name)
!         self.doc_handler.endDocument()
  
      def parse(self, systemId):
          """parse using the URL"""
--- 274,307 ----
          """
          reader = apply(self.make_reader, (fileobj,) + self.reader_args)
  
!         self._cont_handler.startDocument()
          
          if self.want_groupref_names:
              _match_group.clear()
          
!         self._cont_handler.startElement(self.format_name, _attribute_list)
          filepos = 0  # XXX can get mixed up with DOS style "\r\n"
          while 1:
              record = reader.next()  # XXX what if an exception is raised?
              if record is None:
                  break
!             result = _parse_elements(record, self.tagtable, self._cont_handler)
              if result is None:
                  # Successfully read the record
                  continue
!             elif isinstance(result, _exceptions.SAXException):
                  # Wrong format
!                 self._err_handler.fatalError(result)
                  return
              else:
                  # did not reach end of string
                  pos = filepos + result
!                 self._err_handler.fatalError(StateTableEOFException(pos))
  
              filepos = filepos + len(record)
  
!         self._cont_handler.endElement(self.format_name)
!         self._cont_handler.endDocument()
  
      def parse(self, systemId):
          """parse using the URL"""
From chapmanb at arches.uga.edu  Sat Nov 11 12:49:10 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] SwissProt parser
Message-ID: <14861.34454.85255.758400@taxus.athen1.ga.home.com>

Hello all;
      I was writing docs for SwissProt, and noticed the parser
breaking with some of the sequences I was playing around with. I don't 
normally use SwissProt, so I have no idea if these entries are
representative of the format or anything, but the following entries
gave me problems: '023729', '023730', '023731' (some nice Chalcone
synthases). 

The problem was that there is a reference to the NCBI taxonomy id in
the entries that the parser wasn't looking for. It occurs right after
the organism info and looks like:

OX   NCBI_TaxID=41205;

Anyways, I modified the parser so that it would accept this, and added 
the possible information to the sequence class. It seems to work okay
with the entries I mentioned, and still passes the regression
tests. The patch for this is attached.

Please let me know if there are any problems with the patch or
anything. Thanks!

Brad

-------------- next part --------------
*** SProt.py.orig	Sun Jul 16 19:18:57 2000
--- SProt.py	Sat Nov 11 12:30:16 2000
***************
*** 61,66 ****
--- 61,67 ----
      organelle         The origin of the sequence.
      organism_classification  The taxonomy classification.  List of strings.
                               (http://www.ncbi.nlm.nih.gov/Taxonomy/)
+     taxonomy_id       NCBI taxonomy id
      references        List of Reference objects.
      comments          List of strings.
      cross_references  List of tuples (db, id1[, id2][, id3]).  See the docs.
***************
*** 89,94 ****
--- 90,96 ----
          self.organism = ''
          self.organelle = ''
          self.organism_classification = []
+         self.taxonomy_id = ''
          self.references = []
          self.comments = []
          self.cross_references = []
***************
*** 391,396 ****
--- 393,402 ----
          self._scan_line('OC', uhandle, consumer.organism_classification,
                          one_or_more=1)
  
+     def _scan_ox(self, uhandle, consumer):
+         self._scan_line('OX', uhandle, consumer.taxonomy_id,
+                         one_or_more=1)
+ 
      def _scan_reference(self, uhandle, consumer):
          while 1:
              if safe_peekline(uhandle)[:2] != 'RN':
***************
*** 462,467 ****
--- 468,474 ----
          _scan_os,
          _scan_og,
          _scan_oc,
+         _scan_ox,
          _scan_reference,
          _scan_cc,
          _scan_dr,
***************
*** 540,545 ****
--- 547,557 ----
          cols = string.split(line, ';')
          for col in cols:
              self.data.organism_classification.append(string.lstrip(col))
+ 
+     def taxonomy_id(self, line):
+         line = self._chomp(string.rstrip(line[5:]))
+         descr, tax_id = string.split(line, '=')
+         self.data.taxonomy_id = tax_id
      
      def reference_number(self, line):
          rn = string.rstrip(line[5:])
From jchang at SMI.Stanford.EDU  Sat Nov 11 15:34:58 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <14861.14541.155986.608067@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011111234030.28234-100000@taiyang>

> Cayte:
> >  I ran test_gobase.py and then ran a diff between the output and 
> > the file, test_gobase in output.  The diff didn't show any
> > differences.
> 
> test_gobase in the regression test also fails for me (although just
> running test_gobase.py works fine). My output/test_gobase file (which
> should be exactly what is in CVS) looks like:
> 
> testing G405967.htm
> 
> And that's it, which explains why the regressiont test fails for
> me. Cayte, perhaps you have a more recent copy of output/test_gobase 
> then what is in CVS?

Yep, that's what's happening to me as well.  The output/test_gobase file
contains only that single line, but running test_gobase.py generates 65
lines of output.  It looks like output/test_gobase isn't up to date.

Jeff


From jchang at SMI.Stanford.EDU  Sat Nov 11 16:04:36 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] SwissProt parser
In-Reply-To: <14861.34454.85255.758400@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011111302090.28234-100000@taiyang>

Heh.  It looks like ExPASy snuck in a new tag for us.  Thanks for the
update and patch!  It's now checked in.

Jeff


On Sat, 11 Nov 2000, Brad Chapman wrote:

> Hello all;
>       I was writing docs for SwissProt, and noticed the parser
> breaking with some of the sequences I was playing around with. I don't 
> normally use SwissProt, so I have no idea if these entries are
> representative of the format or anything, but the following entries
> gave me problems: '023729', '023730', '023731' (some nice Chalcone
> synthases). 
> 
> The problem was that there is a reference to the NCBI taxonomy id in
> the entries that the parser wasn't looking for. It occurs right after
> the organism info and looks like:
> 
> OX   NCBI_TaxID=41205;
> 
> Anyways, I modified the parser so that it would accept this, and added 
> the possible information to the sequence class. It seems to work okay
> with the entries I mentioned, and still passes the regression
> tests. The patch for this is attached.
> 
> Please let me know if there are any problems with the patch or
> anything. Thanks!
> 
> Brad
> 
> 


From katel at worldpath.net  Sun Nov 12 01:16:37 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
References: <Pine.GSO.4.21.0011101634120.17431-100000@riboweb.Stanford.EDU>
Message-ID: <001f01c04c70$6cc26060$010a0a0a@cadence.com>

> - The test_prodoc regression tests are failing.  This is mostly my fault,
> as the previous version of Prodoc didn't allow copyrights at the end of
> records.  However, this has been fixed.  Cayte, do you mind having another
> go at the tests, and checking in the verified output?
>
  One of the test files has a leading linefeed, 10 decimal,  that messes up
the start tag.  I need to dig up a hex editor to remove it.  For the future,
maybe Prodoc.py should strip white space before the first tag.

                    Cayte


From dalke at acm.org  Sun Nov 12 04:56:21 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
Message-ID: <009b01c04c8e$d13bd440$43ac323f@josiah>

Brad:
>I think the saxlib errors are coming from Martel, which is not 2.0
>friendly, yet. I attached a patch to Martel-0.3/Martel/Parser.py which
>should make Martel work with only the 2.0 libraries (ie. no need to
>install the PyXML package). I believe this should also work with 1.5.2
>with PyXML 0.6.1 installed, but I haven't verified this.

There are saxlib errors with Martel-0.3 when using Python 2.0.  Several
things changed between the old PyXML package and the new builtin module.
They include:
  o  switch from SAX 1.0 to 2.0 support
    - different methods (eg, 'setContentHandler' instead of
'setDocumentHandler')
    - different method arguments (eg, 'characters(content)' instead of
         'characters(text, start, size)' )
  o  removal or renaming of several classes
    - DocumentHandler -> ContentHandler
    - no XML Canonicalization class
    - no 'BaseHandler'
    - no ErrorRaiser class - functionality merged into ErrorHandler
          and ErrorHandler now needs its __init__ to be called.

Brad's patch doesn't catch all of the problems.  This evening I finally
switch all my code over to use Python 2.0 - at least enough that my
regression tests work :)

These changes should probably be included in the upcoming version.  However,
they are *not* backwards compatible either to the Martel 0.3 API or to
Python 1.5.2.  How does that affect the 0.90-d04 release?  How does a
dependency on 2.0 affect a 1.0 release?

(Actually, I should say it's dependent on the PyXML package and not 1.5.2
per se.  It's still tricky because of the API changes between SAX 1.0 and
SAX 2.0 and because I've started using Python 2.0 syntax, like "import
X as Y".)

I've also finished off the iterator support Brad wanted, excepting for
some documentation.  It works, but it's built on top of the callback
method so will always be slower than the SAX-like interface - until
someone spends the time needed to rewrite the code to talk to mxTextTools
directly.

Here's my to-do list for Martel, not all of which will be done for a
hypothetical 1.0:

   o resolve the newline issue

   o interface for version detection
       - only need to read part of a file to determine the format/version
       - support categories?  (Eg, "a PDB format" or "a sequence format")

   o cache tag tables for faster parser creation

   o attribute lists and XML namespaces
       - could be useful for version labels (eg, <swissprot version="38">
              instead of <swissprot38>
       - how to store in a regular expression pattern string
       - I just don't know enough about namespaces to know if I'm doing
              this one correctly.  Any offers to help?

    o better debugging support
       - somehow identify the lastmost character attempted to parse
            (perhaps with a specialized tag table?  Or modify mxTextTools?)
       - SAX Locator support

    o more formats, examples, testing, documentation, etc.


However, I think the core API is now stable, which means it should be
stable enough for people to starting writing parsers based off of it
and not have things change from underneath.

So Jeff, how would you like things to be scheduled?

                    Andrew


From dalke at acm.org  Mon Nov 13 08:59:26 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel-0.35
Message-ID: <000a01c04d79$f0b43f60$b2ab323f@josiah>

Okay, over the weekend I ended up not doing the work I was paid for
and instead worked on Martel.  It's available as usual from
http://www.biopython.org/~dalke/Martel .  This is version 0.35.
Here's the change log from the README:

  Migrated to Python 2.0 and its xml package.  No longer runs under
  older (1.x) Pythons.

  Added more RecordReaders (Until, CountLines, Nothing, Everything).

  Changed the RecordReader protocol to seed the line buffer (in the
  constructor) and to get the final state for the input file and line
  buffer (using remainder()).  Needed to allow chaining of different
  reader types as with headers and footers.

  Added a HeaderFooter Parser for formats like Prosite and PIR which
  have a header and/or a footer with records in between.

  Renamed the StateTable exception to Parser exceptions and removed the
  EOF exception.

  Experimental Iterator support ("make_iterator") as an alternate for
  the pure SAX callback method.

  Improved error reporting.  make_parser and make_iterator takes an
  optional "debug_level".  Better error location is available with
  debug_level == 1 and if it == 2, print current match information to
  stdout.  Warning: debug_level == 1 is about 11 times slower than
  debug_level == 0, which is why it is off by default.

  Support for both the 1.1 and 1.2 mxTextTools.


For people like Brad who are learning how to use Martel, try
"expression.make_parser(debug_level = 1)" or debug_level = 2.  That
really helps pin down where an error is likely located.

BTW, I started a Prosite parser.  The documentation isn't all that
helpful, and I've already found a few errors.  For example,
in the prosite 39 release, PTS_EIIA_2 has a 5 digit date!  (Interestingly,
the online version from expasy.ch has only 4 digits but the INFO
UPDATE is from 1995.)

                    Andrew
                    dalke@acm.org


From johann at egenetics.com  Mon Nov 13 09:44:05 2000
From: johann at egenetics.com (Johann Visagie)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel-0.35
In-Reply-To: <000a01c04d79$f0b43f60$b2ab323f@josiah>; from dalke@acm.org on Mon, Nov 13, 2000 at 06:59:26AM -0700
References: <000a01c04d79$f0b43f60$b2ab323f@josiah>
Message-ID: <20001113164405.A41426@fling.sanbi.ac.za>

Andrew Dalke on 2000-11-13 (Mon) at 06:59:26 -0700:
> 
> Here's the change log from the README:
> 
>   Migrated to Python 2.0 and its xml package.

Just to make extra sure I understand:  Does that mean Martel now only uses
the xml package as installed as part of Python 2.0's standard libraries, and
not the "extended" xml package as installed by PyXML 0.6.1 (a.k.a _xmlplus)?

-- Johann

From dalke at acm.org  Mon Nov 13 15:16:29 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel-0.35
Message-ID: <00a501c04daf$f4938ce0$b2ab323f@josiah>

Johann Visagie <johann@egenetics.com>:
>Just to make extra sure I understand:  Does that mean Martel now only uses
>the xml package as installed as part of Python 2.0's standard libraries,
and
>not the "extended" xml package as installed by PyXML 0.6.1 (a.k.a
_xmlplus)?

That is correct.  There are no dependencies on PyXML.  The core Martel
code uses all stock Python 2.0.  I figured that was a good thing.

                    Andrew
                    dalke@acm.org


From chapmanb at arches.uga.edu  Tue Nov 14 00:18:58 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel-0.35
In-Reply-To: <000a01c04d79$f0b43f60$b2ab323f@josiah>
References: <000a01c04d79$f0b43f60$b2ab323f@josiah>
Message-ID: <14864.52034.569146.154418@taxus.athen1.ga.home.com>

Andrew:
> Okay, over the weekend I ended up not doing the work I was paid for
> and instead worked on Martel.  

Join the club :-). Seriously, thanks for this -- the new version looks 
great!

>   Migrated to Python 2.0 and its xml package.  No longer runs under
>   older (1.x) Pythons.

Thanks for catching all of the changes I missed for 2.0 support. 
This new version flushed out some errors I made in my Clustalw parser 
(changes are committed to CVS).

>   Experimental Iterator support ("make_iterator") as an alternate for
>   the pure SAX callback method.

I had a chance to play with this a little, and seem to be grokking
things a lot better. I modified my Martel based Fasta.py parser to
use an iterator, so it now acts a little more like the biopython 
Fasta parser and only reads one record if a file is passed to it.

Looks nice, although I definately need to play with it a lot more.

> For people like Brad who are learning how to use Martel, try
> "expression.make_parser(debug_level = 1)" or debug_level = 2.  That
> really helps pin down where an error is likely located.

This is a really nice feature. Thanks, this'll be a big help.

BTW, I took a minute to distutilize Martel (takes about as long as
copying everything to site-packages :-), which I guess we'll need
to do anyways to include it in the next release. I put everything into 
a Martel top level package, and install it like that. Anyways, do you
want this? 

Thanks again for the new release.

Brad


From katel at worldpath.net  Tue Nov 14 04:47:54 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
References: <Pine.GSO.4.21.0011111234030.28234-100000@taiyang>
Message-ID: <001d01c04e20$05ae4180$010a0a0a@cadence.com>

  Further investigation of prodoc showed that it choked on TRAILING
whitespace.  The parser read the first record ok.  pdoc00472.txt had some
white space that caused the parser to look for another record.  IMHO, white
space between records should be ignored.

  I have some cut and paste errors to fix in gobase.py.  Since they are in
the comments they don't cause a failure but I don't want it to be too
obvious that its a part-time effort.:)

                           Cayte


From jchang at SMI.Stanford.EDU  Tue Nov 14 20:15:40 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <001d01c04e20$05ae4180$010a0a0a@cadence.com>
Message-ID: <Pine.GSO.4.21.0011141715230.25475-100000@riboweb.Stanford.EDU>

>   Further investigation of prodoc showed that it choked on TRAILING
> whitespace.  The parser read the first record ok.  pdoc00472.txt had some
> white space that caused the parser to look for another record.  IMHO, white
> space between records should be ignored.

Agreed.  I'll take a look at this soon.

Thanks,
Jeff


From jchang at SMI.Stanford.EDU  Tue Nov 14 20:36:29 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <14861.14541.155986.608067@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011141734160.25475-100000@riboweb.Stanford.EDU>

> Great! Do you need any help with anything (besides the points below,
> of course :-)? Also, what is the deadline for rolling this?

No real deadline, except for *very soon now*.  I think I've got things
handled for now, but I remember that you promised me to look into rpm's
and windows binaries when the source release is made!  :)

[alignment/substitution code]
> Do you want this to go in the next release? I think the code is
> good to go (it gets the Brad-seal-of-approval :-), but it is up to
> you. Just give me the word and I can check it in.

Yes, if Iddo agrees as well.  Please let me know if it's going in!

Thanks,
Jeff


From chapmanb at arches.uga.edu  Wed Nov 15 14:27:19 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <Pine.GSO.4.21.0011141734160.25475-100000@riboweb.Stanford.EDU>
References: <14861.14541.155986.608067@taxus.athen1.ga.home.com>
	<Pine.GSO.4.21.0011141734160.25475-100000@riboweb.Stanford.EDU>
Message-ID: <14866.58263.897534.73422@taxus.athen1.ga.home.com>

Jeff:
>  I remember that you promised me to look into rpm's
> and windows binaries when the source release is made!  :)

Darn, I thought you forgot :-). Seriously, I looked at the
documentation and tried to learn a little about rpms, and it appears
as if you can make rpms using distutils as easily as:

python setup.py bdist_rpm

As far as I can tell (using rpm -qpl the.rpm), the rpm appears to be
complete and in good order.

So, I should have no problem making rpms for linuxppc (which is the
only linux system I have access to) -- hopefully we can get people to
volunteer for other systems as long as we can provide the simple
instructions for them. Maybe we can ask about this on the main list
once the new distribution is out.

Windows will take me a little longer -- there are no docs in
distutils, and I still need to learn myself some python on Windows. I
will work on it though :-)

[should SubsMat go in?]
> Yes, if Iddo agrees as well.  Please let me know if it's going in!

Okee dokee, I just put it in, along with tests and an update on
setup.py. Please let me know if any of the tests fail or if it gives
you any problems.

I'm ccing this to Iddo (not sure if he listens in on the dev list) but 
hopefully he can make a post about it on the main list and announce
that it is in there for people to play with.

Enjoy!

Brad


From idoerg at cc.huji.ac.il  Thu Nov 16 05:29:33 2000
From: idoerg at cc.huji.ac.il (Iddo Friedberg)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] 0.90-d04 coming soon...
In-Reply-To: <14866.58263.897534.73422@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.30_heb2.09.0011161214430.23191-100000@new-shum>

Hi Brad & Jeff,

OK, here's the announcement. I think it should be cut & pasted to the
general announcement about 0.90-d04 Feel free to make changes in order to
accomodate documentation pointers, or anything else. (The Align module
accepted replacement matrix generator?) If I need to make
a more elaborate announcement, let me know.

---------------------------- CUT HERE ----------------------------------
SubsMat: a module for generating substitution matrices from user data.
Documentation is available on
http://biopython.org/wiki/html/BioPython/SubsMat.html Accepted replacement
matrices (the initial input for a substitution matrix) may be generated
using the Align module. XXX documentation pointer? XXX

FreqTable: a module for generating alphabet (amino-acid/nucleotide)
frequency tables from user data. Documentation is available on:
http://biopython.org/wiki/html/BioPython/FreqTable.html


----------------------------- END --------------------------------------


On Wed, 15 Nov 2000, Brad Chapman wrote:

:
: [should SubsMat go in?]
: > Yes, if Iddo agrees as well.Please let me know if it's going in!
:
: Okee dokee, I just put it in, along with tests and an update on
: setup.py. Please let me know if any of the tests fail or if it gives
: you any problems.
:
: I'm ccing this to Iddo (not sure if he listens in on the dev list) but
: hopefully he can make a post about it on the main list and announce
: that it is in there for people to play with.
:
: Enjoy!
:
: Brad
:
:


--

/* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2;
/*  |  */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2*
/*  |  */x*y+b,x=t)t=x*x-y*y+a;}
/* --- ddo Friedberg */


From jchang at SMI.Stanford.EDU  Sat Nov 18 02:05:03 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
Message-ID: <Pine.GSO.4.21.0011172300380.28146-100000@taiyang>

- with Martel 0.35, the alignment stuff now works

- I've fixed the Prosite and Prodoc parsers so that they now ignore
whitespace.  Cayte, do you mind having another look at the
test_prodoc regression test?  Please verify the results and check in the
output file.  There's currently no output/test_prodoc

- gobase is still failing the regression test.  The output/test_gobase
only contains one line, and the regression tests are generating more than 
that.

- I don't remember if I addressed it before, but yes, Brad, we can drop
the dash.  The release will be called 0.90d04.  :)

Once these are fixed, we can go ahead with the release.

Jeff


From katel at worldpath.net  Sat Nov 18 18:11:28 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <Pine.GSO.4.21.0011172300380.28146-100000@taiyang>
Message-ID: <000701c051b4$e2e51140$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: <biopython-dev@biopython.org>
Sent: Friday, November 17, 2000 11:05 PM
Subject: [Biopython-dev] next release closer (?)


> - with Martel 0.35, the alignment stuff now works
>
> - I've fixed the Prosite and Prodoc parsers so that they now ignore
> whitespace.  Cayte, do you mind having another look at the
> test_prodoc regression test?  Please verify the results and check in the
> output file.  There's currently no output/test_prodoc
>
> - gobase is still failing the regression test.  The output/test_gobase
> only contains one line, and the regression tests are generating more than
> that.
>
   Should we change the baseline?  The extra text contains information that
tells whether gobase is providing the information it promised.

                                          Cayte


From katel at worldpath.net  Sat Nov 18 20:10:54 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <Pine.GSO.4.21.0011172300380.28146-100000@taiyang>
Message-ID: <001a01c051c5$91744fe0$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: <biopython-dev@biopython.org>
Sent: Friday, November 17, 2000 11:05 PM
Subject: [Biopython-dev] next release closer (?)


> - with Martel 0.35, the alignment stuff now works
>
> - I've fixed the Prosite and Prodoc parsers so that they now ignore
> whitespace.  Cayte, do you mind having another look at the
> test_prodoc regression test?  Please verify the results and check in the
> output file.  There's currently no output/test_prodoc
>
   Prodoc now passes the standalone test and I committed test_prodoc.  With
my upgrade to Python2, br_regrtest causes this output.

Traceback (most recent call last):
  File "br_regrtest.py", line 36, in ?
    test_support = __import__("test/test_support")
NameError: Case mismatch for module name test/test_support
(filename c:\python20\lib\test_support.py)

  Its puzzling because only lower case is used as far as I can see.  My
environment is:

TMP=c:\windows\TEMP
TEMP=C:\windows\TEMP
PROMPT=$p$g
winbootdir=C:\WINDOWS
COMSPEC=C:\WINDOWS\COMMAND.COM
PATH=C:\JDK1.2.2\BIN;JUNIT3.2;C:\PROGRA~1\CYGNUS~1\ECOS\TOOLS\BIN;C:\BC5\BIN
;C:\
CYGNUS\CYGWIN~1\H-I586~1\BIN;C:\PROGRA~1\TCL\BIN;C:\PERL\BIN;C:\PYTHON20;C:\
WIND
OWS;C:\WINDOWS;C:\WINDOWS\COMMAND;C:\PROGRA~1\NETWOR~1\MCAFEE~1;C:\PKWARE;C:
\CVS

JAXPHOME=C:\Program Files\JavaSoft\Jaxp1_0-ea1
PYTHONPATH=.;C:\PYTHON20\LIB\;C:\PYTHON20\WXPYTHON\;C:\BIOPYT~1.90-;C:\TEXTT
O~1;
C:\PYXML-~1.1;C:\BIOPYT~1.90-\TESTS
VSL=C:\MODSOFT\VSL
CLASSPATH=C:\PROGRAM
FILES\JAVASOFT\JAXP1_0-EA1\JAXP.JAR;C:\;C:\BIOJAVA;C:\JUNIT
3.2\JUNIT.JAR;.
CVSROOT=cvs@cvs.biopython.org
windir=C:\WINDOWS
BLASTER=A240 I5 D1 T4
CMDLINE=python br_regrtest.py


                              Cayte


From chapmanb at arches.uga.edu  Sat Nov 18 17:10:48 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <001a01c051c5$91744fe0$010a0a0a@cadence.com>
References: <Pine.GSO.4.21.0011172300380.28146-100000@taiyang>
	<001a01c051c5$91744fe0$010a0a0a@cadence.com>
Message-ID: <14870.65128.625316.93731@taxus.athen1.ga.home.com>

Cayte writes:
> With my upgrade to Python2, br_regrtest causes this output.
> 
> Traceback (most recent call last):
>   File "br_regrtest.py", line 36, in ?
>     test_support = __import__("test/test_support")
> NameError: Case mismatch for module name test/test_support
> (filename c:\python20\lib\test_support.py)
> 
>   Its puzzling because only lower case is used as far as I can see.  My
> environment is:
[windows]

I just noticed this problem, since I was messing around trying to
learn python on windows just this morning! I checked in a fix earlier 
today, so if you 'cvs update' you should get it.

I just changed the offending line to:

from test import test_support

I'm not sure if there are reasons not to do it this way, but it seemed 
to make sense to me. Hopefully Andrew will speak up if there is a good 
reason not to have it this way.

Brad


From jchang at SMI.Stanford.EDU  Sun Nov 19 03:39:23 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <000701c051b4$e2e51140$010a0a0a@cadence.com>
Message-ID: <Pine.GSO.4.21.0011190038080.29179-100000@taiyang>

> > - gobase is still failing the regression test.  The output/test_gobase
> > only contains one line, and the regression tests are generating more than
> > that.
> >
>    Should we change the baseline?  The extra text contains information that
> tells whether gobase is providing the information it promised.

The baseline contains only:
testing G405967.htm


It's pretty uninformative, and it must be incomplete.  Please check in the
hand verified output from the regression tests.

Thanks,
Jeff


From jchang at SMI.Stanford.EDU  Sun Nov 19 03:48:26 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <001a01c051c5$91744fe0$010a0a0a@cadence.com>
Message-ID: <Pine.GSO.4.21.0011190044120.29179-100000@taiyang>

>    Prodoc now passes the standalone test and I committed test_prodoc.

I'm having a few problems with this suite of tests:

- br_regrtest saves the name of the regression test in the first line of
the output file.  For example, the first line of output/test_seq is
"test_seq".  This seems to be missing with test_prodoc.

- "python br_regrtest test_prodoc.py" fails because it can't find
"Prosite/Doc/pdoc00472.txt".  That file isn't in the CVS repository and
needs to be added.
test test_prodoc crashed -- exceptions.IOError : [Errno 2] No such file or
direc
tory: 'Prosite/Doc/pdoc00472.txt'
1 test failed: test_prodoc       

- The test_prodoc.py output contains the addresses of Reference objects.
references
    <Bio.Prosite.Prodoc.Reference instance at 007FEF2C>
    <Bio.Prosite.Prodoc.Reference instance at 007FD19C>
    <Bio.Prosite.Prodoc.Reference instance at 007FD12C>
This won't work, because the object address is going to be different from
computer to computer.  Instead of the pointer, please print out the
reference, or at least enough of the string to know that it's parsed
correctly.

Thanks,
Jeff


From dalke at acm.org  Sun Nov 19 04:21:02 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)
Message-ID: <001601c0520a$0adbc720$edab323f@josiah>

Let me restore context first.  The question was how to handle different
newline conventions, where native text files on the Mac use '\015', on
unix use '\012' and MS use '\012\015'.  This convention is hidden somewhat
behind the C file I/O layer.  In text mode it translates the local
newline convention to the single character '\n', which in ASCII is
'\012'.  In binary mode the input character stream is not modified.

Martel uses '\n' as the end of line character and converts it to chr(10).
This requires the input be in ASCII, which is a good assumption.  (I
don't expect to run Martel under an IBM 370 any time soon - that being
an EBCDIC machine :)

This means Martel should be able to run under an OS so long as the input
text data has been converted to use the local machine's line ending
convention and the file was opened in text mode, which is the default.
For example, ftps must be done in ASCII mode instead of binary.

Networks make things more complicated.  For example, an http connection
only supports the binary mode of ftp meaning there is no way to negotiate
local newline conventions and automatically convert as needed.  Similarly,
files shared over NFS or SMB are not automatically converted.  (Samba
does have a flag to allow for automatic conversion, but I don't believe
it is used very often.)

On top of that, people are well known for being human - they are
inconsistent.  I had considered a wrapper which would read the first few
characters of a file to determine the newline convention and convert as
needed.  Some time ago Brad pointed out:
> There are times where people have generated files like this in my lab
> (the sequencer is running Windows, but they like to play around on
> the files on a Mac -- I still don't know how they got a mix of line
> breaks -- I think by cutting and pasting between files with different
> line breaks).

As another case, Roger Sayle pointed out to me yesterday that some of the
data files are made by concatenating other files.  For example, by merging
the gbpri* from GenBank into one file.  Suppose some of those files were
downloaded via FTP in ASCII mode and some in binary.  Then the newline
convention changes throughout the merged file.  Since this does happen,
it would be nice to handle this case gracefully.

Earlier I had outlined a few ways to solve the problem:
>    1) require the input to be converted to the local line ending and
>  provide no support for doing so

Not graceful.  No one likes this solution.

>    2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't
>  use them; instead leaving the decision up to the client code

As proposed, this wouldn't work because the line convention can change.
Instead, it would need to be a "FromAny" which would allow any of the
three endings.

>    3) provide a tool which autodetects endings and uses the right
>  adapter

My original thought was to read up to the first newline and use that
convention for the rest of the file.  This would not work.  Instead,
the "FromAny" converter would always have to check for all three endings.

>    4) http://members.nbci.com/_XOOM/meowing/python/index.html

I mentioned this library for two reasons.  First, I had heard it was
faster than Python's readline() method.  This is true, but it is almost
exactly as fast as Python's readlines(), which I had been using so it
offers no performance benefit.

Second, I thought it allowed having all three of "\n", "\r" and "\r\n"
as the newline character.  After investigation I found out that it doesn't.
You can change the end-of-line marker but it must still be a single string.

It turns out that mxTextTools has a linesplit function which takes a
string and converts it into newlines - allowing any of the three
conventions.
As it is written it is not appropriate for Martel because it strips out the
newlines.  A guarantee in Martel's design is that it must send all the
characters to the ContentHandler's "characters" method so they may be
counted.  This allows indexing by just counting the number of characters
which have gone through.  If the end of line characters are discarded,
it is impossible to know if the tossed text was one character or two.

The mxTextTools function is very easy to implement and this deficiency
is readily remedied.

(An alternate solution is to add a tell method to the parser which
gets mapped down to the file handle's tell method.  This is a problem
because the readers are free to read ahead many characters for faster
reading.  When tell is called, it would have to figure out where the
callback is in the parsing.  This is complicated even more by text mode
file handles on MS where tell works correctly and increased by two for
"\r\n" even though only a single character is returned.)

>    5) define an EOL = Re(r"\n|\r\n?")
>
>  I don't like 5 because people will forget to use it.

Brad liked it because:
> 1. Easy to implement, and isn't very likely to break :-).
>
> 2. Provided the regexp would recognize Mac line breaks (hmmm, I'm not
> positive what those look like) then this could deal with files with
> multiple different types of line breaks without whining.

I ended up having a more serious problem with this option.  Martel allows
what I call "RecordReaders" that are really two parsers in one.  The first
does a simple scan of the input stream to identify records and the second
parses the records into SAX events.  Together they create the same SAX
events as a standard parser but use much less memory.  (They only need
enough memory to parse the most complex record, while the standard parsers
parse the whole file at once to need roughly 10 times as much RAM as the
input data.)

The input data files are line oriented so my RecordReaders used the file's
"readlines" method with a sizehint to read a large but memory-bounded
number of lines, then scanned those lines to identify the records.  The
lines are joined back together into one string and parsed with the second
stage parser.

This makes file reading about as fast as you can do with native Python.
However, readlines uses the local platform's definition of newline and
there is no way to support all three conventions.  If I had a Mac text
file, which uses '\013', and tried to read a line under unix, I would get
everything in the file as one line since there is no '\010' in the file.

So I'm left with the conclusion that I need to write a specialized reader
which understands all three line conventions, rather like the 'FromAny'
mentioned above.  Unlike mxTextTool's linesplit function it would need
to keep the end-of-line identifier.  Unlike my RecordReaders, it couldn't
use the readline or readlines methods but would have to call read directly.

Here's how the data would go through the system.  Create a file object
(open a file, use urllib to create a socket connection or use a StringIO).
Wrap it inside a FromAny object, which uses the file's read() method to
implement its own readlines() method, which supports the different newline
conventions.  The RecordReader uses those lines to find the records then
merges them back into one string for the record parser.

Very complicated, with lots of pure Python code to make things slow.
Hence, I didn't like it.

As I was looking through the QIO code I came up with an idea, which I
think ultimately arises from the bioperl list.  Bioperl's FASTA parser
works by defining $/ (the line separator) to "\n>".  This pushes the
problem of record identification to Perl and quite simplifies the read
loop.  The QIO interface would allow the same simplification, so
searching for a SWISS-PROT record could be turned into looking for the
string "\nID   ".

QIO doesn't support all three endings.  I could modify the code, but
then that would require (yet) another C extension.  We're already
including mxTextTools, which does text processing - why not use it?
That's when I dug through the module and found the 'linesplit' function,
which is written in pure Python using the taglist.

I hacked together some test code to try it out.  It is attached.  It
parses SWISS-PROT records by looking for lines matching "//" followed
by "\n", "\r\n" or "\r" and using them as end of record indicators.  After
some tweaking of the tatable to remove a subtable call, I found out it
was 15% *faster* than the readlines code.  (I haven't yet tested it on
MS to ensure it handles both text and binary reads, but it should. :)

It works on a large block of text at a time rather than splitting them
apart into lines.  The record parser uses a single block of text so
the current RecordReaders need to string.join the lines back into a
block.  This new approach only needs to use a single subslice to get
that text, so overall it should be a bit faster still.

WHAT DOES THIS GET US?

This new approach makes record identification much faster and allows
the record readers to work on files containing a mix of any of the three
standard line encodings.  This means my objection to option 5 no longer
includes any objections based on parsing performance.

There are still some problems with usability.  In binary mode, or with
foreign text files, the parser can send back "\n", "\r\n" or "\r"
characters as newlines.  The format definition must support them.  The
format definition for newline is simply "\n" which is insufficient.

For example, suppose you just want to read the text of the DE line
in SWISS-PROT.  The current format definition might be:

DE = Group("DE", Re("DE   (?P<description>[^\n]*)\n"))

This would have to be replaced with

DE = Group("DE", Re("DE   (?P<description>[^\n\r]*)(\n|\r\n?)"))

There are two changes: one for "description" from [^\n] to [^\n\r]
and the other from \n to \n|\r\n? .

They are simple mechanical transformation but the need for them may
be sufficiently different from common use that it would be nice to
automate it or otherwise ignore their need.

I mentioned one possibility - define EOL = Re("\n|\r\n?").  Then the
DE format definition becomes:

DE = Group("DE", Re("DE   (?P<description>[^\n\r]*)") + EOL)

This is simpler to type and less error prone than using the full,
correct definition, but isn't as nice as "\n".  It isn't standard
so I think people will forget to put in the EOL in place of "\n".
Finally, it doesn't fix the need to use [^\n\r].

Here is a solution which appears to make the problem disappear.
If "\n" is ever found outside of a [] then replace it with "\n|\r\n?".
If it is ever found inside of a [], then also include "\r".

The problem is that it violates one of my basic design beliefs.  Things
which act different should not look the same.  Other regular expression
parsers do not support this conversion so I do not want to use it.

(Martel doesn't support backtracking inside of repeats.  You may
jusifiable call it a violation of this belief.  On the other hand, any
solution which works in Martel should work using a normal regular
expression engine, so the implementation is really a subset of existing
behaviour and not a new behaviour.)

Here's another possibility.  There are still some letters unused as escape
sequences in both Perl and Python.  What about defining \R to mean
"platform-independent newline character"?  When used outside of []s it
gets turned into "\n|\r\n?" and when used inside of []s is the same as
[\r\n].  I chose \R because \N in perl is used for "named char".

Its use would change the DE definition from
  DE = Group("DE", Re("DE   (?P<description>[^\n]*)\n"))
to
  DE = Group("DE", Re("DE   (?P<description>[^\R]*)\R"))

It is still a non-standard definition, which means it isn't as nice as
I would like for it to be.  However, I haven't found any other regular
expression grammer which supports alternate newline conventions so
there isn't really any standard to be standard to.

The only time it would be used is in the Martel definition.  Converting
the Martel expression back to a regular expression pattern would use the
"\n|\r\n?" or "[\r\n]" descriptions, so the expression itself is still
standard; the \R is simply a shorthand notation, like \n is itself shorthand
for \010.

The conversion is mechanical and is in most cases a simple text
substitution.  That makes it easy to use, although it's existance and
need would need to be carefully documented and enforced with social
pressure.  ("You *do* know that \n doesn't work as well as \R, right?")

In closing, I've come up with a way to increase parsing performance
and in a way which is platform independent and requires few changes in
people's understanding of regular expression syntax.  The first part
(increased performance) does not affect what I consider to be the stable
part of the API.  The second part does change things from their commonly
accepted use so I would like to hear any comments people may have about it.

                    Andrew
                    dalke@acm.org

P.S.
  In retrospect using mxTextTools for the record reading is obvious
and solves quite a few problems I was having.  I hate it when that happens
because it make me feel dim-witted.  After all, I've been thinking about
this problem for a long time - why was I stuck in the old solution?  But
that's the way things go :)

P.P.S. - and irrelevant
  FSU (my alma mater) beat Florida and will likely be ranked as the number
2 college team.  Miami's complaining that they'll be #3 after FSU even
though they beat FSU.  Of course, they forget '89 when FSU beat Miami
but Miami was ranked #1.


-------------- next part --------------
#  % python read2.py
#  80000 records found with readlines
#  Time for readlines 118.07604301
#  80000 records found with find_record_ends
#  Time for tagtables 100.489358068
#  %

from Martel import Generate
from mx import TextTools as TT

tagtable = (
    # Is the current line the end of record marker?
    (None, TT.Word, "//", +5, +1),

    # Make sure it ends the line
    ("end", TT.Is, '\n', +1, -1),  # matches '\n'
    (None, TT.Is, '\r', +3, +1),
    ("end", TT.Is, '\n', +1, -3),
    ("end", TT.Skip, 0, -4, -4),

    # Not the end of record marker, so read to the end of line
    (None, TT.AllInSet, TT.invset('\r\n'), +1, +1),

    # Check if EOF
    (None, TT.EOF, TT.Here, +1, TT.MatchOk),

    # Not EOF, so scarf any newlines
    (None, TT.AllInSet, TT.set('\r\n'), TT.MatchFail, -7),
    )

def find_record_ends(text):
    result, taglist, pos = TT.tag(text, tagtable)
    ends = []
    for tag in taglist:
        ends.append(tag[2])
    return ends

def test1():
    expect = (
        '//\n',
        'Andrew Dalke\n//\n',
        'was //\nhere\n//\n',
        '//\n'
        )
    text = "//\nAndrew Dalke\n//\nwas //\nhere\n//\n//\n"

    ends = find_record_ends(text)
    assert len(expect) == len(ends), (len(expect), len(ends))
    prev = 0
    for ex, end in map(None, expect, ends):
        s = text[prev:end]
        assert ex == s, (ex, s)
        prev = end
    print "expected lines found"

def test2():
    infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    s = ""
    count = 0
    while 1:
        data = infile.read(1000000)
        #print "Loop", count, len(s)
        if not data:
            break
        ends = find_record_ends(s+data)
        if not ends:
            s = data
            continue
        s = data[ends[-1]:]
        count = count + len(ends)
    assert not s, "still have data: %s" % repr(s[:200])
    print count, "records found with find_record_ends"

def test3():
    infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    count = 0
    while 1:
        lines = infile.readlines(1000000)
        if not lines:
            break
        #print "Loop", count
        for line in lines:
            if line == "//\n":
                count = count + 1
    print count, "records found with readlines"
    
def do_time():
    import time

    t1 = time.time()
    test3()
    t2 = time.time()
    print "Time for readlines", t2-t1

    t1 = time.time()
    test2()
    t2 = time.time()
    print "Time for tagtables", t2-t1
    
if __name__ == "__main__":
    #test1()
    #test2()
    do_time()
From chapmanb at arches.uga.edu  Sun Nov 19 12:30:29 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Small change to NCBIWWW
Message-ID: <14872.3637.919725.20317@taxus.athen1.ga.home.com>

Hello all;
I was using NCBIWWW.blast() to access the BLAST cgi script this
morning and noticed that the parameters used to restrict the organism
type to BLAST against weren't quite working right. The gi_list
variable was included as a dictionary parameter, but wasn't actually
being passed to the CGI script. 

Attached is a patch which fixes this (oooh, one line fix! Very
impressive :-). and also adds support for the LIST_ORG box in which
you can specify an arbitrary organism to blast against (ie. other
organisms that aren't in their pull down box).

Let me know if anything doesn't seem right with this. Both options now 
seem to work fine for me. Thanks!

Brad


-------------- next part --------------
*** NCBIWWW.py.orig	Thu Oct 19 21:31:54 2000
--- NCBIWWW.py	Sun Nov 19 11:54:59 2000
***************
*** 531,537 ****
  
  def blast(program, datalib, sequence,
            input_type='Sequence in FASTA format',
!           double_window=None, gi_list='(None)', expect='10',
            filter='L', genetic_code='Standard (1)',
            mat_param='PAM30     9       1',
            other_advanced=None, ncbi_gi=None, overview=None,
--- 531,538 ----
  
  def blast(program, datalib, sequence,
            input_type='Sequence in FASTA format',
!           double_window=None, gi_list='(None)',
!           list_org = None, expect='10',
            filter='L', genetic_code='Standard (1)',
            mat_param='PAM30     9       1',
            other_advanced=None, ncbi_gi=None, overview=None,
***************
*** 542,548 ****
            ):
      """blast(program, datalib, sequence,
      input_type='Sequence in FASTA format',
!     double_window=None, gi_list='(None)', expect='10',
      filter='L', genetic_code='Standard (1)',
      mat_param='PAM30     9       1',
      other_advanced=None, ncbi_gi=None, overview=None,
--- 543,550 ----
            ):
      """blast(program, datalib, sequence,
      input_type='Sequence in FASTA format',
!     double_window=None, gi_list='(None)',
!     list_org = None, expect='10',
      filter='L', genetic_code='Standard (1)',
      mat_param='PAM30     9       1',
      other_advanced=None, ncbi_gi=None, overview=None,
***************
*** 570,575 ****
--- 572,579 ----
                'DATALIB' : datalib,
                'SEQUENCE' : sequence,
                'DOUBLE_WINDOW' : double_window,
+               'GI_LIST' : gi_list,
+               'LIST_ORG' : list_org,
                'INPUT_TYPE' : input_type,
                'EXPECT' : expect,
                'FILTER' : filter,
From jchang at SMI.Stanford.EDU  Sun Nov 19 12:43:54 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Small change to NCBIWWW
In-Reply-To: <14872.3637.919725.20317@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0011190943370.29498-100000@taiyang>

Good catch!  I incorporate the fix into the CVS tree.

Thanks,
Jeff


On Sun, 19 Nov 2000, Brad Chapman wrote:

> Hello all;
> I was using NCBIWWW.blast() to access the BLAST cgi script this
> morning and noticed that the parameters used to restrict the organism
> type to BLAST against weren't quite working right. The gi_list
> variable was included as a dictionary parameter, but wasn't actually
> being passed to the CGI script. 
> 
> Attached is a patch which fixes this (oooh, one line fix! Very
> impressive :-). and also adds support for the LIST_ORG box in which
> you can specify an arbitrary organism to blast against (ie. other
> organisms that aren't in their pull down box).
> 
> Let me know if anything doesn't seem right with this. Both options now 
> seem to work fine for me. Thanks!
> 
> Brad
> 
> 
> 


From dalke at acm.org  Mon Nov 20 01:01:10 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)
Message-ID: <002b01c052b7$48e47b40$fbab323f@josiah>

Me:
>It works on a large block of text at a time rather than splitting them
>apart into lines.  The record parser uses a single block of text so
>the current RecordReaders need to string.join the lines back into a
>block.  This new approach only needs to use a single subslice to get
>that text, so overall it should be a bit faster still.

I've got a first pass at replacing the StartsWith RecordReader.  The
old reader (readlines and string.join) takes about 160 seconds to read
sprot38.dat while the new one takes about 90 seconds.  I also checked
and they return identical results.

>Here's another possibility.  There are still some letters unused as escape
>sequences in both Perl and Python.  What about defining \R to mean
>"platform-independent newline character"?  When used outside of []s it
>gets turned into "\n|\r\n?" and when used inside of []s is the same as
>[\r\n].  I chose \R because \N in perl is used for "named char".

I've got a first pass at this as well.  sre_parse.py is very clean code
to modify.  The result seems to pass my regression tests.  Still need to
try it against real data on a non-unix platform.

But that's all for the next day or so since I've got to get back to
paying work now.

                    Andrew
                    dalke@acm.org


From katel at worldpath.net  Mon Nov 20 05:00:19 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <Pine.GSO.4.21.0011190044120.29179-100000@taiyang>
Message-ID: <002b01c052d8$b19ddb60$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: "Cayte" <katel@worldpath.net>
Cc: <biopython-dev@biopython.org>
Sent: Sunday, November 19, 2000 12:48 AM
Subject: Re: [Biopython-dev] next release closer (?)


> >    Prodoc now passes the standalone test and I committed test_prodoc.
>
> I'm having a few problems with this suite of tests:
>
> - br_regrtest saves the name of the regression test in the first line of
> the output file.  For example, the first line of output/test_seq is
> "test_seq".  This seems to be missing with test_prodoc.

  I'm still investigating this.
>
> - "python br_regrtest test_prodoc.py" fails because it can't find
> "Prosite/Doc/pdoc00472.txt".  That file isn't in the CVS repository and
> needs to be added.
> test test_prodoc crashed -- exceptions.IOError : [Errno 2] No such file or
> direc
> tory: 'Prosite/Doc/pdoc00472.txt'
> 1 test failed: test_prodoc
>

   I checked the file in.
> - The test_prodoc.py output contains the addresses of Reference objects.
> references
>     <Bio.Prosite.Prodoc.Reference instance at 007FEF2C>
>     <Bio.Prosite.Prodoc.Reference instance at 007FD19C>
>     <Bio.Prosite.Prodoc.Reference instance at 007FD12C>
> This won't work, because the object address is going to be different from
> computer to computer.  Instead of the pointer, please print out the
> reference, or at least enough of the string to know that it's parsed
> correctly.
>
   I fixed this.

  I still need to add a baseline for rebase.
                           Cayte


From katel at worldpath.net  Tue Nov 21 02:53:04 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <Pine.GSO.4.21.0011190044120.29179-100000@taiyang>
Message-ID: <002a01c05390$15759f80$010a0a0a@cadence.com>

----- Original Message -----
From: "Jeffrey Chang" <jchang@SMI.Stanford.EDU>
To: "Cayte" <katel@worldpath.net>
Cc: <biopython-dev@biopython.org>
Sent: Sunday, November 19, 2000 12:48 AM
Subject: Re: [Biopython-dev] next release closer (?)


> >    Prodoc now passes the standalone test and I committed test_prodoc.
>
> I'm having a few problems with this suite of tests:
>
> - br_regrtest saves the name of the regression test in the first line of
> the output file.  For example, the first line of output/test_seq is
> "test_seq".  This seems to be missing with test_prodoc.
>
   Its also missing from test_seq and  test_Fasta when I run them
standalone.  Is the test name inserted manually into the baseline files?  If
so, I'll also have to add it to test_rebase.

   I get an error from test_prosite.  My OS is Win98.

test test_prosite crashed -- exceptions.TypeError : an integer is required

                                               Cayte


                          Cayte


From jchang at SMI.Stanford.EDU  Mon Nov 20 23:50:54 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <002a01c05390$15759f80$010a0a0a@cadence.com>
Message-ID: <Pine.GSO.4.21.0011202046210.3244-100000@taiyang>

[Jeff]
> > - br_regrtest saves the name of the regression test in the first line of
> > the output file.  For example, the first line of output/test_seq is
> > "test_seq".  This seems to be missing with test_prodoc.

[Cayte]
>    Its also missing from test_seq and  test_Fasta when I run them
> standalone.  Is the test name inserted manually into the baseline files?  If
> so, I'll also have to add it to test_rebase.

br_regrtest should do it automatically.  From the biopython/Tests
directory, run:
python br_regrtest -v test_Fasta

and the first line will be 'test_seq'.


To generate the file in the output directory, do:
python br_regrtest -g test_Fasta

This will create a file in output/test_Fasta, whose first line will be
'test_Fasta'.  This will need to be verified by hand in order for the
regression tests to be accurate.

Sorry about the confusion.


>    I get an error from test_prosite.  My OS is Win98.
> 
> test test_prosite crashed -- exceptions.TypeError : an integer is required

I don't know.  Andrew?

One thing you can try, is to run:
python test_prosite.py

and see the full stack dump that's generated.

Jeff


From dalke at acm.org  Tue Nov 21 00:38:24 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
Message-ID: <001201c0537d$45c30500$9bac323f@josiah>

Jeff on the error Cayte's getting:
>>    I get an error from test_prosite.  My OS is Win98.
>>
>> test test_prosite crashed -- exceptions.TypeError : an integer is
required
>
>I don't know.  Andrew?
>
>One thing you can try, is to run:
>python test_prosite.py
>
>and see the full stack dump that's generated.

I would need to see the stack trace.  I cannot reproduce the error using
the current CVS version.

I don't see the string "an integer is required" anywhere in the Prosite
code, nor in the rest of the biopython distribution.  Looking at the
source code for Python, that only arises during a conversion to int.
So I would need to find out which call to int is failing and the text
that it's trying to convert.

                    Andrew
                    dalke@acm.org


From dalke at acm.org  Tue Nov 21 00:59:33 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: solving the newline problem (was Re: [Biopython-dev] Martel-0.3 available)
Message-ID: <005f01c05380$3a250d80$9bac323f@josiah>

[Continuing the thread]

mxTextTools is really fast, but it's very hard to write raw
tagtables.  It's all one state table with no symbolic jump
labels.  Blech.

I finished up the first drafts of the new StartsWith and EndsWith
RecordReaders.  The new EndsWith parser is about 50% faster than
the readlines based one.  The source code is temporarily at
http://www.biopython.org/~dalke/RecordReader.py for anyone who
wants to review it.  Not much yet in the way of comments, I'm afraid.

                    Andrew


From katel at worldpath.net  Sat Nov 25 02:13:44 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] mxTextTools
Message-ID: <002701c056af$406b1900$010a0a0a@cadence.com>

  Andrew, do you have a Windows compiled version of mxTextTools.  My VC++ CD
disappeared and the old pyd no longer works  with Python 2.00.

                  Cayte


From katel at worldpath.net  Sat Nov 25 22:34:06 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Martel 3.5 recompile
Message-ID: <000b01c05759$bc000e40$010a0a0a@cadence.com>

  My VC++ CD turned up, so I recompiled.  The following stack trace
appeared, when I ran m tests.


C:\biopython-0.90-d03\Martel\UnitTests>python RunMartelTestCase.py
Traceback (most recent call last):
  File "RunMartelTestCase.py", line 12, in ?
    import MartelTestCase
  File "MartelTestCase.py", line 23, in ?
    import Martel
  File "c:\biopyt~1.90-\Martel\__init__.py", line 3, in ?
    import Expression
  File "c:\biopyt~1.90-\Martel\Expression.py", line 25, in ?
    import Parser
  File "c:\biopyt~1.90-\Martel\Parser.py", line 34, in ?
    import TextTools
  File "c:\textto~1\TextTools.py", line 230, in ?
    def _replace3(text,what,with,
NameError: There is no variable named 'FS'

  The recompile of mxTextTools.pyd gave 1 warning.
LINK : warning LNK4049: locally defined symbol "_mxBMS_Type" imported


            Cayte


From katel at worldpath.net  Sat Nov 25 22:51:31 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Martel
Message-ID: <001501c0575c$2abe5ba0$010a0a0a@cadence.com>

  I answered my own question, the FS is in the __init file, which was in a
different path.

           Cayte


From katel at worldpath.net  Sat Nov 25 23:50:58 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] Martel Unit Test Cases
Message-ID: <000901c05764$87ca44a0$010a0a0a@cadence.com>

  The UnitTest cases pass now, on  Martel  3.5, except for the newline and a
test case involving backslashed backslashes ( test_n2 ).  These also fail in
version 3.0.

                       Cayte


From katel at worldpath.net  Sun Nov 26 23:28:37 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <001201c0537d$45c30500$9bac323f@josiah>
Message-ID: <004501c0582a$84228760$010a0a0a@cadence.com>

----- Original Message ----- 
From: "Andrew Dalke" <dalke@acm.org>
To: <biopython-dev@biopython.org>
Sent: Monday, November 20, 2000 9:38 PM
Subject: Re: [Biopython-dev] next release closer (?)


> Jeff on the error Cayte's getting:
> >>    I get an error from test_prosite.  My OS is Win98.
> >>
> >> test test_prosite crashed -- exceptions.TypeError : an integer is
> required
> >
> >I don't know.  Andrew?
> >
> >One thing you can try, is to run:
> >python test_prosite.py
> >
> >and see the full stack dump that's generated.
> 
> I would need to see the stack trace.  I cannot reproduce the error using
> the current CVS version.
> 
> I don't see the string "an integer is required" anywhere in the Prosite
> code, nor in the rest of the biopython distribution.  Looking at the
> source code for Python, that only arises during a conversion to int.
> So I would need to find out which call to int is failing and the text
> that it's trying to convert.
> 
C:\biopython-0.90-d03\Tests>python test_prosite.py
Patterns: 'A.' 'A' '(A)'
Traceback (most recent call last):
  File "test_prosite.py", line 88, in ?
    m = p.search(Seq.Seq(x))
  File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search
    m = self.grouped_re.search(buffer(seq.data), pos, endpos)
TypeError: an integer is required

                                     Cayte


From katel at worldpath.net  Mon Nov 27 01:33:07 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
References: <001201c0537d$45c30500$9bac323f@josiah> <004501c0582a$84228760$010a0a0a@cadence.com>
Message-ID: <005801c0583b$e8218700$010a0a0a@cadence.com>

----- Original Message -----
From: "Cayte" <katel@worldpath.net>
To: "Andrew Dalke" <dalke@acm.org>; <biopython-dev@biopython.org>
Sent: Sunday, November 26, 2000 8:28 PM
Subject: Re: [Biopython-dev] next release closer (?)


>
> ----- Original Message -----
> From: "Andrew Dalke" <dalke@acm.org>
> To: <biopython-dev@biopython.org>
> Sent: Monday, November 20, 2000 9:38 PM
> Subject: Re: [Biopython-dev] next release closer (?)
>
>
> > Jeff on the error Cayte's getting:
> > >>    I get an error from test_prosite.  My OS is Win98.
> > >>
> > >> test test_prosite crashed -- exceptions.TypeError : an integer is
> > required
> > >
> > >I don't know.  Andrew?
> > >
> > >One thing you can try, is to run:
> > >python test_prosite.py
> > >
> > >and see the full stack dump that's generated.
> >
> > I would need to see the stack trace.  I cannot reproduce the error using
> > the current CVS version.
> >
> > I don't see the string "an integer is required" anywhere in the Prosite
> > code, nor in the rest of the biopython distribution.  Looking at the
> > source code for Python, that only arises during a conversion to int.
> > So I would need to find out which call to int is failing and the text
> > that it's trying to convert.
> >
> C:\biopython-0.90-d03\Tests>python test_prosite.py
> Patterns: 'A.' 'A' '(A)'
> Traceback (most recent call last):
>   File "test_prosite.py", line 88, in ?
>     m = p.search(Seq.Seq(x))
>   File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search
>     m = self.grouped_re.search(buffer(seq.data), pos, endpos)
> TypeError: an integer is required
>
>                                      Cayte
>
  Its OK with the laest Pattern.py


From dalke at acm.org  Thu Nov 30 00:30:56 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
Message-ID: <013b01c05a8e$b7ea10c0$62ac323f@josiah>

Cayte:
>> C:\biopython-0.90-d03\Tests>python test_prosite.py
>> Patterns: 'A.' 'A' '(A)'
>> Traceback (most recent call last):
>>   File "test_prosite.py", line 88, in ?
>>     m = p.search(Seq.Seq(x))
>>   File "c:\biopyt~1.90-\Bio\Prosite\Pattern.py", line 168, in search
>>     m = self.grouped_re.search(buffer(seq.data), pos, endpos)
>> TypeError: an integer is required
>>
>>                                      Cayte
>>
>  Its OK with the laest Pattern.py

I checked in the CVS logs since I wanted to ensure that it was a proper
code fix and not some side effect of perhaps another bug.  Looks like
Brad fixed that on 2000/09/27 with the following:
<         m = self.grouped_re.search(buffer(seq.data), pos, endpos)
---
>         if endpos:
>             m = self.grouped_re.search(buffer(seq.data), pos, endpos)
>         else:
>             m = self.grouped_re.search(buffer(seq.data), pos)
173c176,179
<         m = self.grouped_re.match(buffer(seq.data), pos, endpos)
---
>         if endpos:
>             m = self.grouped_re.match(buffer(seq.data), pos, endpos)
>         else:
>             m = self.grouped_re.match(buffer(seq.data), pos)

This would indeed have caused the problem you identified, and updating
to the newer version properly fixed it.

The base reason for the problem was a difference between Python 1.5.2's
re module and 2.0's sre.  In the first module, the "search" method is
defined in Python as:

  def search(self, string, pos=0, endpos=None):

in the second, it's defined in C as
    in start = 0;
    int end = INT_MAX;
    ...
    if (!PyArg_ParseTupleAndKeywords(args, kw, "O|ii:search", kwlist,
                                     &string, &start, &end))

which when translated into Python is

  def search(self, string, pos=0, endpos=sys.maxint):

There's little anyone could have done to guard against this change in
the underlying Python API.

Also, BTW, when we make the change to Python 2.0, I suggest changing
Pattern.py's Prosite.search so that endpos defaults to sys.maxint
instead of the None it does now.  This keeps it compatible with the
Python API and prevents the if-branches in the code - I don't like
branches since they are harder to test fully.

                    Andrew


From dalke at acm.org  Thu Nov 30 00:51:10 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
Message-ID: <015301c05a91$8b547660$62ac323f@josiah>

>>         if endpos:
>>             m = self.grouped_re.search(buffer(seq.data), pos, endpos)
>>         else:
>>             m = self.grouped_re.search(buffer(seq.data), pos)

Oops.  Just realized this code contains a bug when endpos == 0.  The
test should instead be for

  if endpos is not None:
    ...

Fixed in CVS.

                    Andrew


From chapmanb at arches.uga.edu  Thu Nov 30 15:27:07 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
In-Reply-To: <013b01c05a8e$b7ea10c0$62ac323f@josiah>
References: <013b01c05a8e$b7ea10c0$62ac323f@josiah>
Message-ID: <14886.47131.653099.144288@taxus.athen1.ga.home.com>

[Cayte's Prosite problem]

Andrew:
> I checked in the CVS logs since I wanted to ensure that it was a proper
> code fix and not some side effect of perhaps another bug.  Looks like
> Brad fixed that on 2000/09/27 with the following:
[change because of a different default argument in python 2.0]

Doh! Sorry, that I didn't say anything about this -- I'd actually
forgotten about this fix and it didn't cross my mind that
Cayte's problem could be related to it. This is my fault, I should
have posted to the dev list about this...

[my "fix"]
> Oops.  Just realized this code contains a bug when endpos == 0.

Double Doh! Thanks for the fix on this. I apologize again, I should
have posted to the list on this -- I was just thinking I was making a
"simple" change, but should have been more careful. Since that time
I've become a lot more paranoid, and started posting patches for other
people's code instead of fixing directly in CVS, and this is a good
reason why I should do this.

This brings up a point -- does anyone think it would be worthwhile to
have CVS commits and log messages sent to the dev list? Bioperl has
this and I think it's very worthwhile -- then for cases like this I
would feel more comfortable going ahead with a small "fix" because I
know Andrew would read the log... Then he could think: "hey, what's
this punk doing messing with my code?" and go in and check up on the
fix, if he feels like it. Just an idea, but maybe posting patches is
better...

I would really like to have bugs sent to the dev list when they come
in -- I just noticed a couple from Iddo that I should have dealt with
(I think that is all fixed now, regardless), but didn't realize were
there. Whadda you all think about this?

> Also, BTW, when we make the change to Python 2.0, I suggest changing
> Pattern.py's Prosite.search so that endpos defaults to sys.maxint
> instead of the None it does now.  This keeps it compatible with the
> Python API and prevents the if-branches in the code - I don't like
> branches since they are harder to test fully.

This is true -- your fix would be better, if you are not worried about 
1.5.2 compatibility. As far as I can tell, we are officially requiring 
2.0 and no one seems to mind, so if you personally aren't worried
about people having to have 2.0 to use Prosite, then I give a big +1
to switching to the more stable code. This way I won't have to stay up 
nights worrying about more bugs in my "fixes" :-).

Brad


From dalke at acm.org  Thu Nov 30 22:43:46 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:54 2005
Subject: [Biopython-dev] next release closer (?)
Message-ID: <005601c05b49$187c1860$73ac323f@josiah>

Brad:
>Doh! Sorry, that I didn't say anything about this -- I'd actually
>forgotten about this fix and it didn't cross my mind that
>Cayte's problem could be related to it. This is my fault, I should
>have posted to the dev list about this...

Couple of points.  I know I tend to forget the details of things
after a couple of months, and I don't expect others to have better
memories.  In this case, my first thought to the problem was that
some string wasn't being converted to an integer - which wasn't
the case - so there wasn't much of a clue to jog your memory.

Secondly, even if you posted to the list two months ago when you
did the fix, the odds of me (or anyone else) remembering is also
pretty low.


>This brings up a point -- does anyone think it would be worthwhile to
>have CVS commits and log messages sent to the dev list? Bioperl has
>this and I think it's very worthwhile

A couple of months ago there was a bug report on the bioperl list (not
the dev list, the general one).  As I recall, someone reported a problem
in BLAST parsing where it didn't understand one of the fasta|id|label
forms.  It turns out the code had been fixed and the problem was that
the person who reported the bug hadn't tried the newer bioperl release.

It took a while for there to be any response regarding the problem.
Part of the reason was the poor bug report, but the other, more major
part was likely that no one remembered that there had been a change/fix.
After all, it had been 6 months previous.   This despite that bioperl
has the CVS email notifications and they have both more developers and
more people using the BLAST parser.

It was much easier just to go to the CVS logs for the appropriate file
and see all the changes at once; which is what I did to track down how
Cayte's problem disappeared.

Therefore, I do not think that CVS email notifications would really
help out for this case.

That's not saying that email notification don't have other uses.  Two
I can think of are "hey, what's this punk doing messing with my code?"
and status updates.

The first of these can be done with other tools, like looking at which
files changed when doing a cvs update, or using the cvs log to see the
list of changes.

I didn't use the best of phrases for the latter of these.  It's an idea
I picked up from McConnell's "Rapid Development" (a book which I fully
recommend, btw).  He suggests breaking a project up into "mini-milestones",
which are tasks that can be completed within a couple of days.  When the
task is completed, the developer sends out a short email to the group
saying it's done.  It might also point out how to use the new feature or
describe that it's 100 times faster than the older code or ....  The
result helps improve communications, helps the project manager track
the task timelines, and gives everyone a bit of good news that things
are getting done.

I think CVS updates are too fine grained for this level of communications.
They report on the changes done on a per-file basis and not on a per-task
or per-bug basis.  When you read the email notification you need to
reconstruct what's going one.  (You still need to do that when looking at
the cvs log, but then you can use cvs diff to see the actual code changes
and you have the code right there to look through.)

Also, I get enough email as it is now - I don't want to get email for
every bug report (esp. ones like "Oops, fixed typo in 'protien'")

Therefore, I still don't think that automatic email notification of CVS
changes is all that useful an ability.

 -- then for cases like this I
>would feel more comfortable going ahead with a small "fix" because I
>know Andrew would read the log... Then he could think: and go in and check
up on the
>fix, if he feels like it. Just an idea, but maybe posting patches is
>better...

>I would really like to have bugs sent to the dev list when they come
>in -- I just noticed a couple from Iddo that I should have dealt with
>(I think that is all fixed now, regardless), but didn't realize were
>there. Whadda you all think about this?

Bugs are different.  Unless there's someone willing to triage bugs and
pass them on the right person (and hopefully the person will respond)
it might as well go to everyone.  Plus, as I've said, I don't like having
a lot of email so there's a negative feedback loop to reduce the bug
count :)

So I've no problems with this.  Though in the future if there are both
a lot of bugs and a lot of different development, something will need
to be done to make sure there is some way to direct the right messages
to the right people.  (Improving signal to noise.)

>> Also, BTW, when we make the change to Python 2.0, I suggest changing
>> Pattern.py's Prosite.search so that endpos defaults to sys.maxint
>> instead of the None it does now.  This keeps it compatible with the
>> Python API and prevents the if-branches in the code - I don't like
>> branches since they are harder to test fully.


>As far as I can tell, we are officially requiring 2.0 and no one seems
>to mind,

I thought the switchover to 2.0 wasn't going to occur until after the
next release (the one that's coming closer (?) :)  So I was going to
wait until then - so long as I remember.

> This way I won't have to stay up nights worrying about more bugs
> in my "fixes" :-).

There is an extreme viewpoint to this.  As I understand XP, any desired
behaviour should have a test for it.  This allows people to change the
code and - so long as the tests still pass - assume the changes are valid.

This doesn't work in the most literal sense since I could have code like

  if endpos == 87655:
      endpos = endpos + 8

and there's no way people will write a test for every possible input
combination.  On the other hand, it is a good practice to test boundary
conditions, so there could (should?) be a test for endpos = None and
endpos = 0.  Had they been present, your bug would have been found right
away.

So one way to sleep more comfortably is to add regression tests.  While
you then lose sleep worrying that you aren't testing everything, I've
found I gain more than I lose.

                    Andrew
                    dalke@acm.org