Prepare Data Tab

In this view, Formulize provides several options for the preprocessing of data: you can smooth the data, handle missing values, remove outliers, normalize scale and offset, and apply a filter.

These processes can be applied to a single variable or simultaneously to any combination of variables. To select a variable for processing, click on that variable in the "Variables" window in the upper left. To select multiple variables, drag across or Ctrl-click on the desired variables.

Check the box next to any of the following preprocessing options and the necessary controls will appear.

PrepareData.png

Smoothing data

Smoothing can greatly improve both the speed of the search and the likelihood of finding accurate solutions. However, you should smooth your data only if you have reason to believe that the source of the data is somewhat continuous.

To smooth one or more variables, do the following:

  1. Select the variable or variables you want to smooth. (The data will be plotted in the lower window.)
  2. Select an independent variable to smooth along, or just smooth across rows.
  3. (Optional) If you want non-uniform smoothing, where some data points are given more weight than others, you can select (or type in) a variable or expression, and that variable or expression will determine the weight given to each row. Details are on the Row Weight page.
  4. Set the desired smoothing level or let Formulize choose it automatically. (Formulize will choose the setting giving the best smooth as determined by generalized cross-validation among cubic b-splines.)

Handling missing values

When a row contains values for one or more variables but has an empty cell in the column for one or more other variables, Formulize can handle the situation in a variety of ways. Choose from among the following options in the drop-down menu labeled "method":

  • Ignore the entire row.
  • Copy value from the previous row.
  • Copy value from the most similar row.
  • Interpolate between rows. (Inserts the mean of the value in the previous row and the value in the next row.)
  • Estimate using other variables. (Linear regression is used to model the variable in terms of the other variables. Missing values are then filled in based on that model. For example, in a data set with variables x and y where y has missing values, y will be modeled as a*x + b, and, in each row with a missing value, the missing value will be filled in by evaluating that expression using the values in that row.)
  • Set to the mean value. (Inserts the mean of all values of that variable.)
  • Set to the median value. (Inserts the median of all values of that variable.)
  • Set to zero.

Normalizing scale and offset

You can normalize by entering an expression of your own, but several common normalizing options can be found in the drop-down box. You can choose to normalize offset by subtracting the mean, subtracting the median, or subtracting the interquartile mean. You can choose to adjust the scale by dividing by the standard deviation, dividing by the interquartile range, or dividing by 103, 106, or 109.

Careful normalization can greatly improve the chances of finding a simple formula that fits your data. You'll find detailed advice in the blog post entitled Normalizing data variables.

Filtering data

To ignore rows that don't meet certain requirements, enter the requirements in the box. Here are some examples based on a data set containing variables x and y:

  • x > 0 filters out rows in which x has a negative value.
  • (x > 0) & (y > 0) filters out rows in which either x or y has a negative value.
  • (x = 0) | (abs(x-y) > 42) filters out rows in which the value of x is 0 or the difference between x and y is greater than 42.

Need more?

If your data could benefit from more sophisticated preprocessing, you may want to process it in another application then transfer the resulting data into Formulize.