Feature Construction with Genetic Programming
In working with LIGO supernovae data composed of noise-triggers (glitches) and supernovae-candidates (synthetic injections), we are pressing beyond a fitness ceiling measured by Precision-Recall. No matter the depth of GP tree or number of generations evolved, these features are not enabling the level of classification we desire.
Therefore, I am working to construct a new set of features. My first effort will be to use Karoo GP to evolve a small, multivariate expression which retains the value of its P-R score. In theory, when introduced back into the feature list, GP is able to start from this constructed feature, and build upon its inherent fitness score, thereby achieving a higher P-R value.
So, if GP evolves an expression which incorporates three of a dozen available features, and that function scores 80% Precision-Recall, then when evaluated against real data, row-by-row, that single output value itself provides an 80% P-R score without the need to evaluate those in that combination, again. If you have, for example, an evolved multivariate expression which provides an 80% differentiation of classes, its single, solved numeric value is also 80% effective as were the collection of features.
Here is an expression evolved by Karoo GP:
bw1 – 2*low + rh1 + vol/d0
Here is the equivalent expression, the original feature names replaced by the column positions in the dataset represented as a spreadsheet:
G2 – 2*E2 + A2 + B2/C2
Roughly 80% of the data points are in fact split across the x-axis such that class 0 (noise) are below and class 1 (sn event) are above, where the scatter-plot offers 2000 noise-triggers and 2000 candidate-events.
Maybe this will stimulate some ideas, or give a graduate student something to do over the weekend :)
kai