(email to my fellow researchers)
This week has come and gone quickly. I have been to SKA four times, including a week ago Friday. The pace of my work differs now, as my data runs are a minimum of 5-6 hours, but more recently 50+ hours as I push GP to 50 generations of 100-200 Trees against 10,000 lines of features.
I arrive. Log on. Run each accomplished tree against the TEST data. Save the results in my diary. Archive the trees. Mod parameters. Start a new run. Two hours in commute for an hour of work. Sounds like living in San Fran, not Muizenberg.
So far, very good! No glitches in my software. Not a single crash (except when Nadeem accidentally killed Karoo GP 35 gens into a 50 gen run, trying to kill zombies at my request. Silly us! You can’t kill zombies, they’re already dead! I know, old UNIX joke, but it’s still funny :) The multi-core is solid and linear scaling on the 40 core box. The server version (configuration file + single line execution) works well for repeat runs.
I have conducted four full runs, with the fifth now in progress. Keeping a diary of the results, including the Precision / Recall against the TREE ID and it’s polynomial expression. What’s more, every tree is saved in a .csv file at the end of each Generation. Even when Karoo was terminated accidentally, nothing lost.
Now, I need to write a script which loads a .csv and runs with it, as a total population seed (common according to the literature). The continue function is already in place, so just need to slip a loaded list of arrays into population_a and cont.
Consistently, I am seeing 82-86% Precision (in a 50/50 dual class feature set) with Recall just a few points below. I need to look at AUC and one other analysis (rcm by Thuso; can’t recall the name) to get a full understanding of how Karoo is doing.
Ok. Back to work …