[BioPython] Re: UnitTests

Andrew Dalke dalke@acm.org
Sun, 16 Apr 2000 04:09:47 -0600


I'll start off by emphasizing that Cayte's approach and mine
are, as far as I can tell, the same.  I think I can even make
isomorphic mappings between the two, using the following untested
code.

Mine to Cayte's:
  Start with a file of the form "test/test_*.py".
  Create a UnitTest with a single method:

     def test_program( ... ):
         s = """put original program here"""
         import sys, StringIO
         sio = StringIO.StringIO()
         old_stdout = sys.stdout
         sys.stdout = sio
         exec s in {}, {}
         s = sio.getvalue()
         self.assert_equal(s, """put golden comparison text here""")

  (There could even be a factory which takes the test name and
   creates the UnitTest after reading the test/test_*.py program and
   test/output/test_* comparison text.)

Cayte's to mine:
  Start with a UnitTest, named "*UnitTest*.py"
  Rename it to "test_*UnitTest*.py"
  Create the golden comparison code with:
    python br_regrtest.py --module Bio=.. -g test_*UnitTest*.py


However, the first one (mine to Cayte's) is not very useful since
it only reports that the whole test passed or failed.  A better
solution would be to have the setup run the test and store the
results in some instance variable, then have N methods of
the form test_N and one test named test_last.  The test_N methods
check that line N is right/wrong, and test_last makes sure the
line counts are the same.

And Cayte's to mine isn't the best solution since the order of
the lines is determined by Python's hash implementation, which
may change over time, so a better, though less human readable,
solution is to sort the output lines.  The sorted output is
invariant, but it's hard to map back to the original output.
(Or there's some really funky ways to mess around with __getattr__
and __methods__ to enforce the right order.)

Still, they are equivalent.

Me:
>>   1) is the order of method execution guaranteed?  I have test cases
>>   which build on previous results.  It appears either those cases will
>>   have to all be in a single method, or I need a guarantee of order so
>>   I can store results in the instance.


Cayte:
>  No.  Actually, one of the things I like about the Xtreme approach
> is that each test is isolated and you don't have to hunt for hidden
> dependencies.
> IMHO, tests with dependencies should be grouped together.


Now that you mention it, I agree, and I do things that way.  I
hadn't realized that that was a feature of what I've been doing.
The difference is I think a slightly higher set of dependencies
is acceptable.  A code example with an enforced but unneeded
dependency is given below.

Aren't there actually two levels of isolation in what you have?
One level is running the different UnitTests (the .py files can
be run in arbitrary order).  The other is the 'test_' methods
underneath them.  You can push the arbitrariness to either way.

In what I have, there's only one level of isolation, which is
the order the test files are run.  A large test with independent
subcases can be put into several different files.

Granted, in both cases the order of file execution is going
to be constant, but that shouldn't be depended upon.


Me:
>> 3) this is a big one ...  What advantage does this framework give me
>> over the, perhaps, naive version I've been working with?

Cayte:

>  1.  Each test is isolated, so I only have to look for dependencies
> in setup and the local test function.


At some point there has to be a set of dependencies, even if only
two lines.  I put several small but related tests in a single
test_*.py file even though there are no dependencies between them.
I believe they are small enough and related enough that it's okay
to put them together.

Actually, this reminds me of some of the conversations a few decades
ago when studies found that the bug rate increases non-linearly
with function size.  As a result, there were code guidelines which
mandated that functions could be no more than (say) 45 lines long.

Some people reacted by making a lot of short functions, to the
detriment of performance.  Others made every function about 40
lines long but arbitrarily chopped up code into the functions
along with a slew of input parameters.

The real answer was to get a good idea of when to partition code
into functions and remove the fixed maximum limit.  I suspect
the same is true here.

> 2.  This isolation also makes it easier for someone reading the code.
> You may remember the sequence but s/he doesn't.


That's where code comments and printed output comments are useful.
If the test code isn't easy to follow, then it shouldn't have passed
the (putative, alas) code review.

> 3. I can create suites of just a few test cases.  If only three tests
> fail, I don't have to rerun everything.


There are two `test cases' in what you have.  One is the UnitTest
class, and the other is the `test_' methods.  The UnitTests are
rougly equivalent to one of my test*.py, and both can easily be
pulled out and tested independently.

To test only two of the scripts:
   br_regrtest --module Bio=.. test_this.py test_that.py
(the --module can be replaced with a proper PYTHONPATH).

Your test_ methods can also be tested independently, but it calls for
making either a new driver or new wrapper class.  My code is
harder to break up into smaller units, though I think they should
be designed such that there is no need to break them up.

> 4.  I can separate the design of test cases from the mechanics of
> implementing.  Because, I can use a rote mechanical method to
> implement test cases, almost like filling out a form, at least for
> me, it frees me to think more about how to test.


I'm afraid I didn't follow this.

>5. If there's a failure, I know exactly where to look in the code.


I think there's no difference.  If one of my test cases fail, I
I can look at the output file and use it to find the actual test
case and even context about which code and parameters caused the
fault, although this requires adding explicit statements like
"testing spam..." instead of getting it automatically from the
method name like you do.

> I use the diff function when it comes to comparing output files,
> but I'm not sure it's appropriate for every situation.

I pointed out above that the order will depend on Python's
hash implementation.  Probably JPython's implementation will
differ, so diff is indeed not always appropriate.

> I think a list of passed and failed test cases provides a
> useful summary and if mnemonic names are used, give you an idea
> of what was covered.


Agreed, but I'm only really interested in them at the UnitTest
level.  In the regression suites we had at Bioreason, if one of
those failed, we then ran the test directly and diff'ed the outputs.
Because of their design, that gave a lot more information on
what was going on and what went wrong.


So let me describe a couple of the things I like about the
br_regrtest code.

Most importantly, the regression test code is very similar to
the code I write for debugging.  If done right, the debug script
dumps enough information to stdout so someone can easily follow
what's going on.  Once verified, the stdout text becomes golden.
If there's ever a bug, then scanning the new output text, along
with diff, points out exactly where the changes are located and
what it's supposed to be testing.

Secondly, it's easy to describe:  "The driver runs this normal
python script and compares the output to the known, good output.
If they differ, then there's a bug."  The scripts really are
standard python scripts.

Third, if things change, it's easy to make the new gold text.
(On the other hand, it's easy to overwrite gold, so using version
control is a good thing, as well as having some tests in addition
to the output comparison.)  This also means I end up writing more
tests, because they are so easy to generate/verify.

Fourth, the test code is smaller than ones based on UnitTest.
This is because:
  o  the test code is not actually code in the tests
   (Cayte's tests all require a "self.assert_something(...)")
    The test code is the print statement.

  o  test code can be placed in the middle of loops, instead of
    the outermost level (Cayte's tests are usually written
    "return self.assert_something(...)")

  o I allow the subtests to occur in order, even if there are
   no dependencies between them.  (Cayte's puts the different
   subtests in different methods, so needs the method definition.)
 

As a comparison example, how would you implement the following
tests, which makes sure sequence item access works in the expected
manner:

s = "ABCDEF"
print "initial string is", s
seq = Seq(s)

print "Checking in-range, positive indicies"
for i in range(len(s)):
  print i, s[i], "should be equal to", seq[i]
print "Checking in-range, negative indicies"
for i in range(1, len(s)):
  print i, s[-i], "should be equal to", seq[-i]

print "Checking out-of-range indicies"
n = len(s)
for i in (n, n+1, n+100, -n-1, -n-2, -n-100):
  try:
    seq[i]
  except IndexError:
    print i, "is out of range, good."
  else:
    print "** should not allow access of", i

(Note: I'm not sure the current Seq code works this way, since
I don't recall how it treats negative numbers.  It should work
like this.)

I think the corresponding UnitTest code would look something like:

  def setup( ... ):
    self.s = "ABCDE"
    self.seq = Seq(s)
  def teardown( ... ):
    del self.s, self.seq

  def test_positive( ... ):
    results = []
    expected = []
    for i in range(len(self.s)):
      results.append(self.seq[i])
      expected.append(self.s[i])
    return self.assert_equal(results, expected)
  def test_negative( ... ):
    results = []
    expected = []
    for i in range(1, len(self.s)):
      results.append(self.seq[-i])
      expected.append(self.s[-i])
    return self.assert_equal(results, expected)
  def test_out_of_range( ... ):
    results = []
    n = len(self.s)
    for i in (n, n+1, n+100, -n-1, -n-2, -n-100):
      try:
        self.seq[i]
        results.append(0)
      except IndexError:
        results.append(1)
    self.assert_equal(results, [1] * n)

I find the first easier to understand, and only 20 lines long
compared to 30.  (Both can be made shorter by merging the two
in-range tests using "range(-len(s), len(s))", but that's not
the point.)

If there is an error, then looking at the output pins down
the location almost exactly.

Sincerely,

                    Andrew
                    dalke@acm.org