This document describes the operation and configuration of the test scheduling
framework in the pounder2 package.  This document reflects pounder21 as of
2004-12-20, though the scheduler isn't likely to change much.

Author: Darrick Wong <djwong@us.ibm.com>
Copyright (C) 2004 IBM.

Overview
========
The scheduler in the original pounder release was too simplistic--it would kick
off every test at once, simultaneously.  There was no attempt to ramp up the
machine's stress levels test by test, or to run only certain combinations, or
even run the tests one by one before beginning the real load testing.

In addition, the test scripts had a very simple pass/fail mechanism--failure
was defined by a kernel panic/oops/bug, and passing was defined by the lack of
that condition.  There was no attempt to find soft failures--situations where
a test program would fail, but without bringing the machine down.  The test
suite would not alert the user that these failures had occurred.

Consequently, Darrick Wong rewrote the test scheduling framework to achieve
several goals--first, to separate the test automation code from the tests
themselves, to allow for more intelligent scheduling of tests, to give better
summary reports of what passed (and what didn't), and finally to improve the
load testing that this suite could do.

Configuration File
==================
The test suite's configuration file is $POUNDER_HOME/libpounder.sh.  That file
defines several variables that are referenced throughout the rest of this 
document.  Obviously, these variables should be customized for your particular
machine's set up.

Files
=====
Each test has to provide at a bare minimum, two files:

build_scripts/<testname> and test_scripts/<testname>.

Temporary files should go in $POUNDER_TMPDIR; source and binaries should go
in $POUNDER_OPTDIR.

Build Scripts
=============
As the name implies, the script in build_scripts is in charge of downloading
and building whatever bits of code are necessary to make the test run.  The
variable $POUNDER_CACHE defines a web-accessible directory containing cached
tarballs of whatever it is you're trying to build.

Should there be a failure in the build script that is essential to the ability
to run a test, the build script should return 0 to halt the main build process
immediately.

Also, be aware that distributing pre-built binary tarballs is not always a good
idea because distros are not always good at ABI/library path compatibility,
despite the efforts of LSB, FHS, etc.  These will go away in the (very) long
term, however.

Anatomy of a Test Script
========================
The requirements on test scripts are pretty light.  First, the building of the
test ought to go in the build script unless it's absolutely necessary to build
a test component at run time.

Second, the script must catch SIGTERM and clean up after itself.  SIGTERM is
used by the test scheduler to stop tests.

The third requirement is much more stringent: Return codes.  The script should
return 0 to indicate success, 1-254 to indicate failure (the common use is to
signify the number of failures), and -1 or 255 to indicate that the there was
a failure that cannot be fixed.

Note: If a test is being run in a timed or infinite loop, returning -1 or 255
has the effect of cancelling all subsequent loops.

Quick map of return codes to what gets reported:
0             = "PASS"
-1            = "ABORT"
255           = "ABORT"
anything else = "FAIL"

Also note: If a test is killed by an unhandled signal, the test is reported as
failing.

Scheduling Tests
================
The current test scheduler borrows a System V rc script-like structure for
specifying how and when tests should be run.  Everything under the tests/
directory is used for scheduling purposes; files under that tree should have
names that follow the following standard:

   [type][sequence number][name]

"type" is the type of test.  Currently, there are two types, 'D' and 'T'.  'T'
signifies a test, which means that the scheduler starts the test, waits for the
test to complete, and reports on its exit status.  'D' signifies a daemon
"test", which is to say that the scheduler will start the test, let it run in
the background, and kill it when it's done running all the tests in that
directory.

The "sequence number" dictates the order in which the test are run.  00 goes
first, 99 goes last.  Tests with the same number are started simultaneously,
regardless of the type.

"name" is just a convenient mnemonic to distinguish between tests.

File system objects under the tests/ directory can be nearly anything--
directories, symbolic links, or files.  The test scheduler will not run
anything that doesn't have the execute bit set.  If a FS object is a
directory, then the contents of the directory are executed sequentially.

Running Tests Repeatedly
========================
Two helper programs are provided to run tests repeatedly:

    timed_loop duration command [arguments]

This program will run "command" repeated until the number of seconds given
as "duration" has passed.

    infinite_loop command [arguments]

This program runs "command" repeatedly until sent SIGTERM.

Examples
========

Let's examine the following test scheduler hierarchy:

tests/
    D00stats
    T01foo
    T01bar
    T02dir/
        T00gav -> ../../test_scripts/gav
        T01hic -> ../../test_scripts/hic
    T03lat

Let's see how the tests are run.  The test scheduler will start off by scanning
the tests/ directory.  First it spawns D00stats and lets it run in the
background.  Next, T01foo and T01bar are launched at the same time; the
scheduler will wait for both of them to complete before proceeding.  Since T01foo
is a file and not just a symbolic link, there is a fair chance that T01foo runs
some test in a loop for a certain amount of time.  In any case, the scheduler
next sees T02dir and proceeds into it.

In the T02dir/, we find two test scripts.  First T00gav runs, followed by
T01hic.  Now there are no more tests to run in T02dir/, so the scheduler heads
back up to the parent directory.  T03lat is forked and allowed to run to
completion, after which D00stats is killed, and the test suite exits.
