Feature Construction with Genetic Programming

In working with LIGO supernovae data composed of noise-triggers (glitches) and supernovae-candidates (synthetic injections), we are pressing beyond a fitness ceiling measured by Precision-Recall. No matter the depth of GP tree or number of generations evolved, these features are not enabling the level of classification we desire.

Therefore, I am working to construct a new set of features. My first effort will be to use Karoo GP to evolve a small, multivariate expression which retains the value of its P-R score. In theory, when introduced back into the feature list, GP is able to start from this constructed feature, and build upon its inherent fitness score, thereby achieving a higher P-R value.

So, if GP evolves an expression which incorporates three of a dozen available features, and that function scores 80% Precision-Recall, then when evaluated against real data, row-by-row, that single output value itself provides an 80% P-R score without the need to evaluate those in that combination, again. If you have, for example, an evolved multivariate expression which provides an 80% differentiation of classes, its single, solved numeric value is also 80% effective as were the collection of features.

Here is an expression evolved by Karoo GP:

bw1 – 2*low + rh1 + vol/d0

Here is the equivalent expression, the original feature names replaced by the column positions in the dataset represented as a spreadsheet:

G2 – 2*E2 + A2 + B2/C2

Roughly 80% of the data points are in fact split across the x-axis such that class 0 (noise) are below and class 1 (sn event) are above, where the scatter-plot offers 2000 noise-triggers and 2000 candidate-events.

Maybe this will stimulate some ideas, or give a graduate student something to do over the weekend :)

kai