The problem of experimental verification
As an expert on statistics and machine learning you are asked to
supply a method for model selection to some new problems:
A number of deep sea mining robots have been lost due
to system failure. Given the immense cost of the robots, the mining
company wants you to predict the risk of a loss as accurate as
possible. Up to now about a hundred of the machines have been lost,
under very different conditions. Decennia of experience with deep sea
mining have taught the mining company to use some sophisticated risk
evaluation models that you have to apply. Can you recommend MDL?
A deep sea mining robot has got stuck between some rocks. The standard
behaviors that were supposed to free the robot have failed. Some of
the movements it made have won it partial freedom, others have made
the situation only worse. So far, about a hundred movements have been
tried and the outcomes have been carefully recorded. To choose the
next action sequence, can you recommend MDL?
Before we are going to use MDL on problems that are new to us, we want
to be sure that it is indeed a strong theory, valid even without the
need of any preprocessing of the data or other optimizations that are
common if a method is repeatedly applied to the same domain. In the
case of MDL there are countless ways of expanding and compressing data
and sooner or later we will find one that matches good models to short
descriptions in a specific domain. But this does not convince us that
it can be universally applied.
To experimentally verify that MDL would be a good choice to solve
the problems above, we want to
- test it on a broad variety of problems
- prove that we use the shortest description possible
The Statistical Data Viewer
There are two factors that limit the type of data that is usually used
for statistical experiments: availability and programming constraints.
Data availability
A major problem in machine learning and in statistics in general is
the availability of appropriate data. One of the basic principles of
statistics says that a sample that has been used to verify one
hypothesis cannot be used to verify another one. And one and the same
sample can never be used to both select and to prove a hypothesis. If
we would do so, we would simply optimize on the sample in question.
This applies to model selection as well. Not only are the individual
models hypotheses in the statistical sense, a general method for model
selection can itself be viewed as a hypothesis. If we want to
experimentally verify methods for model selection we need a large
supply of unspoiled problems and data.
Mathematical programming
A major obstacle to an objective diversity of the data is the way the
common mathematical packages work. They are based on scripting
languages that make heavy use of predefined function calls. This
integrates nicely with most mathematical concepts, which are usually
defined as functions. But it is not the preferred way to handle
complex data structures or to conduct sophisticated experiments. It
also severely reduces the quality of the outcome of an experiment. It
is quite difficult and tiring to manipulate the graphical
representations of data by using function calls.
Another problem of a function oriented approach is that it is the
responsibility of the user to keep the definitions and the results of
an experiment together. Users are often reluctant to play with the
settings of complex experiments as that requires an extensive version
control of scripts and respective results. And instead of spending a
lot of time on writing a lot of different scripts for a lot of
different experiments, researchers tend to do only minor changes to an
experiment once it works correctly, and repeat it on large amounts of
similar data. It is easier to try a method ten thousand times on the
same problem space than to try it a few times on two different problem
spaces.
A visual approach
To guarantee diversity of the learning problems in an objective way we
developed the Statistical Data Viewer. It is a modular graphical user
interface that can visualize all the abstract difficulties of model
selection. At the same time it has very simple programming interfaces
that give the user all the freedom of his favorite programming
language. The definitions and results of the experiments are made
available through a sophisticated editor. Experiments can be set up in
a fast and efficient way. Problems can visually be selected for
diversity. The performance of selection methods can be analyzed in a
number of interactive plots. All graphical representations are fully
integrated into the control structure of the application, allowing the
user to change views and to select and manipulate anything they show.
To develop a working application within reasonable limits of time, the
first version of the application is limited to the model family of
polynomials and to two-dimensional regression problems, which are
easier to visualize. Polynomials are used widely throughout science
and their mathematics are well understood. They suffer badly from
overfitting.
Great care was taken to give uninitiated students and interested lay
persons access to the theory as well. No scripting language is needed
to actually set up an experiment. The predefined mathematical
objects---problems, samples, models and selection methods---are all
available as visible objects. They can be specified and combined
through the interactive editor which is extremely easy to use. The
essential versions of MDL are implemented. They can be analyzed and
compared with another independent and successful method for model
selection, cross-validation. In addition, the user can try his or her
own penalization term for two-part MDL.
The progress of an experiment is observable and the execution can be
disrupted at any moment. If possible, the graphs in the plots show the
state of an experiment during its execution. It is possible to
reproduce everything with little effort. All the data are saved in a
standardized well documented format that allows for verification and
further processing by other programs. The graphs are of scientific
quality and available for print.
Available learning problems
To provide the user with a broad range of interesting learning
problems, a number of predefined processes are available. They include
- smooth functions which can be approximated well by
polynomials, e.g. the sinus function
- functions that are particularly hard to approximate like the
step function
- fractals
- time series
- polynomials
When taking a sample from a process it can be distorted by noise from
different distributions: Gaussian, uniform, exponential and Cauchy,
which generates extreme outliers. The distribution over the support
of a process can also be manipulated. It can be the uniform
distribution or a number of overlapping Gaussian distributions, to
simulate local concentrations and gaps.
Samples can also be loaded from a file or drawn by hand on a canvas,
to pinpoint particular problems.
Objective generalization analysis
To map the performance in minimizing the generalization error in an
objective way it is possible to check every model against a test
sample drawn from the same source as the training sample. This has drawn
some criticism from experts as the conventional way to measure the
generalization error seems to be to check a model against the original
source if it is known. I opted against this because:
- When speaking of generalization error the check against a test
sample gives a more intuitive picture of the real world
performance.
- For data drawn by hand or loaded from a file the original
distribution might be unknown. In this case all we can do is
to set aside part of the data as a test sample for evaluation.
Depending on whether the source is known or unknown we now
would have two different standards, the original function for
known sources and a test set for unknown sources.
- If the source is known the test sample can be made big enough
to faithfully portray the original distribution (by the law
of large numbers) and give as fair a picture of the
generalization error as a check against the original process.
- When using the original source the correlation between model
and source has to be weighted against the distribution over
the support set of the source. Especially in the case of
multiple Gaussian distributions over the support this can be
extremely difficult to compute.
- Although we concentrate on i.i.d. samples, this method allows
us also to train a model on a sample with Gaussian noise and
to measure the generalization error on a test sample that was
polluted by Cauchy noise (extreme outliers) and vice versa or
to use different distributions over the support set.
An independent method to compare
Comparing the predictions of MDL with the generalization analysis on
an i.i.d. test set does not show whether MDL is a better method for
model selection than any other method. For this reason the application
includes an implementation of cross-validation. Cross-validation is a
successful method for model selection that is not based on complexity
theory. It can be seen as a randomized algorithm where we select the
model that is most likely to have a low generalization error. The
method divides the training sample a couple of times into a training
set and a test set. For each partition it fits the model on the
training set and evaluates the result against the respective test set
in the same way as the generalization analysis uses an independent
test sample. The results of all partitions are combined to select the
best model.