Summary

The transform method of predictors in spark.ml has a relatively high latency when predicting single instances or small batches, which is mainly due to the overhead introduced by DataFrame operations. For a text classification task on the RCV1 datatset, changing the access level of the low-level "predict" method from protected to public and using it to make predictions reduced the latency of single predictions by three to four folds and that of batches by 50%. While the transform method is flexible and sufficient for general usage, exposing the low-level predict method to the public API can benefit many applications that require low-latency response.

Experiment

I performed an experiment to measure the latency of single instance predictions in Spark and some other popular ML toolkits. Specifically, I'm looking at the the time it takes to predict or classify a feature vector residing in memory after the model is trained.

For each toolkit in the table below, logistic regression was trained on the Reuters RCV1 dataset which contains 697,641 documents and 47,236 features stored in LIBSVM format along with binary labels. Then the wall-clock time required to classify each document in a sample of 100,000 documents is measured, and the 50th, 90th, and 99th percentiles and the maximum time are reported.

All toolkits were tested on a desktop machine with an i7-6700 processor and 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 80ns for Python and 20ns for Scala.

Results

The table below shows the latency of predictions for single instances in milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 2.0.0-SNAPSHOT (on master), respectively. In Spark (Modified) and Spark 2 (Modified), I changed the access level of the predict method from protected to public and used it to perform the predictions instead of transform.

Toolkit	API	P50	P90	P99	Max
Spark	MLLIB (Scala)	0.0002	0.0015	0.0028	0.0685
Spark 2 (Modified)	ML (Scala)	0.0004	0.0031	0.0087	0.3979
Spark (Modified)	ML (Scala)	0.0013	0.0061	0.0632	0.4924
Spark	MLLIB (Python)	0.0065	0.0075	0.0212	0.1579
Scikit-Learn	Python	0.0341	0.0460	0.0849	0.2285
LIBLINEAR	Python	0.0669	0.1484	0.2623	1.7322
Spark	ML (Scala)	2.3713	2.9577	4.6521	511.2448
Spark 2	ML (Scala)	8.4603	9.4352	13.2143	292.8733
BIDMach (CPU)	Scala	5.4061	49.1362	102.2563	12040.6773
BIDMach (GPU)	Scala	471.3460	477.8214	485.9805	807.4782

The results show that spark.mllib has the lowest latency among all other toolkits and APIs, and this can be attributed to its low-level prediction function that operates directly on the feature vector. However, spark.ml has a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 10ms for Spark 2.0.0. Profiling the transform method of logistic regression in spark.ml showed that only 0.01% of the time is being spent in doing the dot product and logit transformation, while the rest of the time is dominated by the DataFrame operations (mostly the “withColumn” operation that appends the predictions column(s) to the input DataFrame). The results of the modified versions of spark.ml, which directly use the predict method, validate this observation as the latency is reduced by three to four folds.

Since Spark splits batch predictions into a series of single-instance predictions, reducing the latency of single predictions can lead to lower latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) using testing_features.map(x => model.predict( x)).collect() instead of model.transform(testing_dataframe).select(“prediction”).collect(), and the former had roughly 50% less latency for batches of size 1000, 10,000, and 100,000.

Although the experiment is constrained to logistic regression, other predictors in the classification, regression, and clustering modules can suffer from the same problem as it is being caused by the overhead due to DataFrames and not by the model itself. Therefore, changing the access level of the predict method in all predictors to public, can benefit applications requiring low-latency and add more flexibility to programmers.

Attachments

Issue Links

duplicates

SPARK-10884 Support prediction on single instance for regression and classification related models

Resolved

links to

[Github] Pull Request #13980 (husseinhazimeh)

Activity

People

Assignee:: Unassigned

Reporter:: Hussein Hazimeh

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jun/16 22:02

Updated:: 05/Jul/16 09:48

Resolved:: 05/Jul/16 09:48