The Paradox of Overfitting

I wrote my master's thesis at the Dutch National Research Institute for Mathematics and Informatics (CWI) in Amsterdam. The subject was the problem of overfitting in machine learning and how to prevent it.

Complexity theory and statistics are often considered very abstract and difficult to study. A large part of my work was dedicated to writing a comprehensible theoretical introduction and to build a graphical tool that visualizes the problems involved. Apart from that I also experimentally verified the correctness of a novel approach to prevent overfitting: mixture MDL.

Screenshot of the Statistical Data Viewer

For an overview of my thesis read the summary. The thesis or parts of it can be downloaded below. Especially the theoretical part is interesting to students. It addresses a number of basic topics that are usually neglected in undergraduate courses. Among them are the different definitions of a good model, of overfitting and of a number of key mathematical concepts like Kolmogorov complexity, individual randomness and typicallity.

My master's thesis (pdf, 1.6 mb):

Thesis: The Paradox of Overfitting

The slides of the presentation

bibtex

The theoretical background to the thesis (pdf):

A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL) eprint arXiv:1005.2364.

A Short Introduction to Kolmogorov Complexity eprint arXiv:1005.2400. (included in the previous)

The Statistical Data Viewer, an open source application that allows you to experiment with the problems of overfitting. You need to have KDE 3 or Qt 3 installed.

Download the source files and compile them on any Linux/Unix system (2 minutes compile time on my 1.3 GH machine, zip, 223 kb).

C++ source files

You can also try the binary. I have two precompiled binaries (zip, ~230 kb):

Fedora Core 1

Suse Linux 8.1

The manual for the application. It can be viewed online as html or downloaded as a pdf file (862 kb):

online manual

pdf manual