Summary of The Paradox of Overfitting

Theory.

Model selection plays an important part in machine learning and in artificial intelligence in general. It is used in many advanced applications. A central problem to model selection is overfitting. It can be measured as the generalization error on test samples. But minimizing the generalization error is not the only definition of a good model. Resemblance of the original function and minimum randomness deficiency are others. While we are not interested in resemblance of the original function we do want to know if minimum randomness deficiency and minimizing the generalization error can be combined.

Kolmogorov complexity is a powerful mathematical tool. It is a basic concept in statistics and computer science. It cannot be computed but it is highly useful for proving bounds and existence for weaker notions of complexity. One of the motives that brought about the conception of Kolmogorov complexity was data prediction. This has finally resulted in the theory of MDL for model selection. MDL selects a model that minimizes the combined complexity of model and data, known as the two-part code. It has been proven that MDL minimizes randomness deficiency in a paper by Nicolai Vershchagin and Paul Vitányi.

Early attempts to reliably estimate model complexity were not very successful. Mixture MDL abandoned the two-part approach altogether and concentrated on the complexity of the data given the degree of the model, not a particular model. A. Barron and F. Liang have combined this version of MDL with the minimax approach, looking at data complexity from a worst case perspective.

Experimental verification.

One of the prominent problems in statistics is the availability of appropriate data. Another is the limitations that scripting imposes on the choice of experiments. To efficiently map the performance of MDL on as broad a selection of problems as possible we introduced our own application, the Statistical Data Viewer. While intended for the advanced scientist, it has an interface that is easy enough to use to allow uninitiated students to explore and comprehend the problems of model selection. Its interactive plots and sophisticated editor allow for a fast and efficient setup and execution of controlled experiments.

All the relevant aspects of model selection are implemented. The problem space consists of two-dimensional regression problems and the models are polynomials. To objectively measure the generalization error we use i.i.d. test samples. To compare MDL to an independent successful method for model selection we implemented cross-validation. The application can easily be extended to other problems, models and methods. This thesis includes a manual and some details of the implementation.

Already in the early experiments a number of critical points emerged that are essential to the evaluation of a method for model selection:

the origin or 0-degree model,
the optimum model,
the region of good generalization: better than half way between the origin and the optimum,
false minima and
the point of real overfitting where the generalization error becomes forever worse than at the origin.

The experiments described in this thesis include

a single frequency sinus wave with Gaussian noise,
the step function with Gaussian noise,
a non-uniform support for the Lorenz attractor, plus Gaussian noise,
normalization of the same problem to the support interval [-1, 1],
the Thom map plus different types of noise---Gaussian, uniform, exponential, Cauchy
two noisy pendula plus Gaussian noise to simulate multiple sources.

Conclusions.

The original version of MDL as proposed by Rissanen failed for almost all problems. Mixture MDL on the other hand proved to be of almost equal strength with cross-validation. The only problem where it consistently failed was a non-uniform distribution over the support. Cross-validation proved that the samples contained enough information for correct predictions and we attribute the failure to a weakness of the theory that will hopefully be cured.

Another point of concern was smoothing. Cross-validation was of little use without it. The generalization analysis would also have benefitted from it as it would have removed the false minima and would have resulted in more reliable criteria that the other methods would have to fulfill. But it would have obscured the excellent quality of the per model predictions of mixture MDL for low degree models. Rissanen's original version and mixture MDL did not need it.

The minimax version of mixture MDL proved to be very indifferent to various types of noise. If we assume that the training and the test sample are i.i.d. this falls short of the optimum. On the other hand, if we relax our assumption that training and test sample are i.i.d. we are presented with an even stronger theory: mixture MDL is capable of good prediction even if the distribution over the measurement noise changes.

Where MDL was able to prevent overfitting we observed that the per model predictions accurately reflected the per model generalization error. This led us to formulate the hypothesis that mixture code length is a good approximation of the expected log generalization error on samples that are i.i.d. distributed with regard to the training sample.

The experiments of this thesis and some tests by potential users have shown that applications like the Statistical Data Viwer are a real alternative to the common mathematical packages. There is a general need in modern tools for statistical research and there are concrete plans to extend the application. The biggest challenge is to convince the research community to cooperate. Researchers working on fundamental theory tend to view such visualization tools as non-theoretic.

back