Taking a sample

There are five different ways to take a sample:
  1. loading a sample from a file,
  2. taking it from one of the predefined processes,
  3. taking it from a polynomial,
  4. drawing it by hand and
  5. merging two or more samples into a new sample.
The method attribute defines which one will be used.

Loading a sample from a file. If the file method is selected, the user has to select a file through the standard file browser. The file has to be in ASCII format. Once the file is selected, a new window pops up and shows the contents of the file. Now there are two ways to select the data: by regular expression or by column. The options can be selected by radio buttons. You should have some experience with regular expressions when using this option. An introduction and a list of special characters are here available.

If the regex option is chosen, a regular expression has to be entered for both the x and the y values. All lines of the file are parsed. For each line that is matched by both expressions a point is added to the sample. In order to work correctly, the expression usually has to match more than only the intended value. The number of the opening parenthesis that encloses the portion of the expression that actually contains the value must be entered in a separate input field. You can check whether the expressions capture the correct values by pressing the apply button. It will show the captured portions of the text in different colors. If you are satisfied, press the OK button.

Example: we want to study the number of sold products as a function of the consumer price. The lines of our input file have the following format:

consumer price: 543.21, production cost: 342.23, items sold: 4300


A regular expression that will match the x-value is

^\D+(\d+\.?\d*)


and the first opening parenthesis captures the value. The y-value will be matched by the expression

^(\D+\d+\.?\d*){2,2}\D+(\d+\.?\d*)


where opening parenthesis number three captures the value.

The other option is to use columns. Every line of the text is divided into a number of columns by a regular expression. This expression is usually very simple, e.g. white space or a colon. The column numbers that specify the x and the y-value have to be entered. For each line of the file that has real values in both columns a point is added to the sample.

Taking a sample from a process. In order to use this option, a process must already have been specified. It can then be selected as an attribute of the sample. A distribution over the measurement noise should also be selected. There are four distributions available: Gaussian, uniform, exponential and Cauchy. The Cauchy distribution is important to simulate far outliers. Like the other distributions the Cauchy distribution is symmetric around the mean and non-increasing. But it has no calculable variance. Its density function is


P(x) = s / (π( ( x-μ )2 + s2 )




where μ is the mean and s the distance between the mean and the point that has half the probability of the mean.

The points will be taken at random over the support, which is the range along the x-axis for which the process is defined. The grains attribute specifies the distribution over the support. It can either be uniform or a number between one and nine. If there is one grain, the distribution will be Gaussian around the center of the support. If there are two grains, there will be a center at one third and another at two thirds of the support range. The distribution over the support will be the joint distribution of two Gaussian distributions. Three grains will have a center at 1/4, 2/4 and 3/4 of the range etc. With nine grains the joint distribution of nine Gaussian distributions is almost uniform.

After the desired size of the sample has been entered at the size attribute, a click on the make action will take the sample from the process. To guarantee that all experiments can be repeated, values of the seed attribute other than zero will always produce the same sample under the same conditions.

Drawing a sample. Drawing a sample is straightforward. When this option is chosen and the sample is in the editor, moving the mouse over the main plot and pressing the right mouse button will produce points on the plot. Points can also be drawn after a sample has already been loaded from a file or been taken from one of the predefined processes. In this way particular problems can be pinpointed, e.g. far outliers. You also can simulate noise along the entire range of the y-axis for a limited range of x. In the example 3,000 points were taken from a pendulum in the usual way. After that, 1,000 points were added by hand to simulate corruption of the measurements at x=[4.6, 5.2] and x=[8.4, 9.3]



Pendulum with corrupted measurement


Taking a sample from a polynomial. Taking a sample from a polynomial is very similar to taking a sample from a process. The only difference is that a polynomial must be specified in a different way. Entering the parameters of a polynomial by hand usually does not lead to the desired results. Instead, the following procedure should be followed: first, draw a line by hand on the plot that has the outline of the desired polynomial. Then, fit a polynomial of the desired degree on this hand drawn line. How to do this will be explained in the section on polynomials. This polynomial should then be used to take other samples from it. The original hand drawn line can be deleted.

If you still insist on entering the parameters by hand, this can be done as well. Save a project that contains a polynomial of the desired degree. All projects are saved in XML format and can easily be edited by hand. Replace the parameters of the saved polynomial by your own values and reload the experiment.

Merging samples. When the merge method is selected, up to four different samples can be specified that will be merged into a new sample. As usual, the take action will create the new sample from the specified source samples. The source samples are not changed or deleted.

Splitting samples. A sample can also be split in two. This is useful when a sample has been taken from a file or drawn by hand and we need to split it into a training and a test sample. Splitting is not a method of a sample. You need to select the split operator from the main menu.

Once created, the split operator is loaded into the editor. The sample that will be split has to be selected and the number or percentage of points that will go into subsample A must be entered. Clicking on the split action will split the selected sample at random. The specified number of points will go to subsample A and the rest will go to subsample B. The new samples are visible on the main plot and are available from the list box. The selected sample is not changed or deleted.
back