This proposal will outline a runtime performance module used to measure the performance of various algorithms in mahout in the three major areas, clustering, regression and classification. The module will be a spray/scala/akka application which will be run by any current or new algorithm in mahout and will display a csv file and a set of zeppelin plots outlining the various criteria for performance. The goal of releasing any new build in mahout will be to run a set of tests for each of the algorithms to compare and contrast some benchmarks from one release to another.
github repo is here: https://github.com/skanjila/mahout, will send pull request when I have 1 algorithm operational
The run time performance application will run on top of spray/scala and akka and will make async api calls into the various mahout algorithms to generate a cvs file containing data representing the run time performance measurement calculations for each algorithm of interest as well as a set of zeppelin plots for displaying some of these results. The spray scala architecture will leverage the zeppelin server to create the visualizations. The discussion below centers around two types of algorithms to be addressed by the application.
The application will consist of a set of rest APIs to do the following:
a) A method to load and execute the run time perf module and takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of files containing various sizes of data sets
/algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40 and finally a set of values for the number of clusters to use for each of the different sizes of the datasets
The above API call will return a runId which the client program can then use to monitor the module
b) A method to monitor the application to ensure that its making progress towards generating the zeppelin plots
The above method will execute asynchronously by calling into the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots showing the normalized time on the y axis and the number of clusters in the x axis. The spray/scala akka framework will allow the client application to receive a callback when the run time performance calculations are actually completed. For now the calculations for measuring run time performance will contain: a) the ratio of the number of points clustered correctly to the total number of points b) the total time taken for the algorithm to run . These items will be represented in separate zeppelin plots.
a) The runtime performance module will run the likelihood ratio test with a different set of features in every run . We will introduce a rest API to run the likelihood ratio test and return the results, this will once again be an sync call through the spray/akka stack.
b) The run time performance module will contain the following metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge and run to completion. These metrics will be reported on top of the zeppelin graphs for both the regression and the different clustering algorithms mentioned above.
How does the application get run. The run time performance measuring application will get invoked from the command line, eventually it would be worthwhile to hook this into some sort of integration test suite to certify the different mahout releases.