Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4328

Python serialization updates make Python ML API more brittle to types

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: MLlib, PySpark
    • Labels:
      None

      Description

      In Spark 1.1, you could create a LabeledPoint with labels specified as integers, and then use it with LinearRegression. This was broken by the Python API updates since then. E.g., this code runs in the 1.1 branch but not in the current master:

      from pyspark.mllib.regression import *
      import numpy
      features = numpy.ndarray((3))
      data = sc.parallelize([LabeledPoint(1, features)])
      LinearRegressionWithSGD.train(data)
      

      Recommendation: Allow users to use integers from Python.

      The error message you get is:

      py4j.protocol.Py4JJavaError: An error occurred while calling o55.trainLinearRegressionModelWithSGD.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
      	at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
      	at org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
      	at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
      	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
      	at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
      	at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
      	at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
      	at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
      	at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
      	at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:56)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                josephkb Joseph K. Bradley
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: