Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16431

Add a unified method that accepts single instances to feature transformers and predictors



    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:


      Current transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its transform method which includes a set of operations on the features of a single instance:

      val column_operation = udf {operations on single instance}

      Adding a new method that operates directly on single instances (e.g. called transformInstance) and using it in the udf instead can be useful:

      def transformInstance(features: featureType): OutputType = {operations on single instance}
      val column_operation = udf {transformInstance}

      Predictors also don’t have a public method that does predictions on single instances. transformInstance can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input).

      This simple change has (at least) three benefits.

      1. Providing a low-latency transformation/prediction method to support machine learning applications that require real-time predictions. The current transform method has a relatively high latency when transforming single instances or small batches due to the overhead introduced by DataFrame operations. I measured the latency required to classify a single instance in the 20 Newsgroups dataset using the current transform method and the proposed transformInstance. The ML pipeline contains a tokenizer, stopword remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the latency percentiles in milliseconds after measuring the time to classify 700 documents.
        Transformation Method P50 P90 P99 Max
        transform 31.44 39.43 67.75 126.97
        transformInstance 0.16 0.38 1.16 3.2

        transformInstance is 200 times faster on average and can classify a document in less than a millisecond. By profiling the code of transform, it turns out that every transformer in the pipeline wastes 5 milliseconds on average in DataFrame-related operations when transforming a single instance. This implies that the latency increases linearly with the pipeline size which can be problematic.

      2. Increasing code readability and allowing easier debugging as operations on rows are now combined into a function that can be tested independently of the higher-level transform method.
      3. Adding flexibility to create new models: for example, check this comment on supporting new ensemble methods.


          Issue Links



              • Assignee:
                hazimeh Hussein Hazimeh
              • Votes:
                1 Vote for this issue
                6 Start watching this issue


                • Created: