Modeling Binary Output

Binary classification attempts to predict a variable that has only two possible outcomes - for example, true or false, or buy or don't buy. This post describes how Eureqa can be used to model a boolean decision or classification value.

Binary classification is also one of the most widely studied problems in machine learning, and there are many optimized approaches for prediction (e.g. neureal nets, support vector machine, etc). Using Eureqa for classification (or symbolic regression in general) has a few advantages:

  • finding models requires less data
  • models can often extrapolate extremely well
  • resulting models are simple to analyze, refit, and reuse
  • the structure of the models gives insight into the classification problem

The last point is the most important - not only can you predict but you can also learn something about how the classification works, as in the example below. This isn't possible with most other methods, but comes at a cost of increased time to find an analytical solution if one exists. Here's how to do it in Eureqa.

Squash Method:

The key to this method is to tell Eureqa to search for equations that tend to be negative when the output is false, and positive when true. We then put solutions inside a step function to obtain outputs of either 1 (true) or 0 (false).

Step 1: Eureqa works with numerical values, so define true outcomes to have value 1, and false outcomes to have value 0. Now, enter in the boolean variable into Eureqa as a column of 0 and 1 values.

Step 2: We want to find formula that predicts 0 and 1 values. One way to do this is to tell Eureqa to search for an equation that goes inside a step function before comparing with the boolean value. For example, we could enter "z = step(f(x,y))" into the search relationship setting, where z is a boolean value we want to model, x and y are other variables in the data set, and f(x,y) is the formula that Eureqa attempts to find. The step function is a built-in function in Eureqa that outputs 1 if the input is positive, and 0 otherwise. In other words, we are telling Eureqa to find equations that tend to be negative when z is 0 (false), and positive when z is 1 (true).

Step 3: Start a Eureqa search as normal. Eureqa reports equations for f(x,y) which is inside a step function. To use these solutions to predict the boolean value outside of Eureqa, we need to substitute the formula back into the search relationship. In other words, remember to place the reported solutions back into a step function to obtain the final model.

Example:

Let's say we collected the following data, where x and y are two input variables, and z is a boolean outcome that we want to model (red = true, green = false):

boolean_ellipse_data.png

We enter in a search relationship as "z = step( f(x,y) )"
We then start the Eureqa search. After a few minutes, Eureqa identified a very accurate solution:

f(x,y) = 1.98 + 2.02*x*y - 3.05*y*y - x*x

You may recognize this equation as a tilted ellipse. Plotting this solution on the data makes this clear:

boolean_ellipse_soln.png

Here, we used Eureqa to identify a boolean model of whether a data point would be red or green based on the 2D location of x and y. The resulting solution shows that the data can be separated by an ellipse.

Advanced:

Another type of squashing function is the logistic function which varies smoothly between 0 and 1. It provides a better search gradient than the step function which has almost none. For example, we could enter a search relationship instead as:

z = logistic( f(x,y) )

A side effect is that logistic(f(x,y)) can produce intermediate values, such as 0.77 or 0.001. Therefore, we would need to threshold this value to get final 0 or 1 outputs. A simple way to threshold at 0.5 is to simply replace the logistic with a step function for the final step to make final predictions of the boolean value.