Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12157

Support numpy types as return values of Python UDFs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.2
    • None
    • PySpark, SQL

    Description

      Currently, if I have a python UDF

      import pyspark.sql.types as T
      import pyspark.sql.functions as F
      from pyspark.sql import Row
      import numpy as np
      
      argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
      
      df = sqlContext.createDataFrame([Row(array=[1,2,3])])
      df.select(argmax("array")).count()
      

      I get an exception that is fairly opaque:

      Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
              at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
              at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
              at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
              at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
              at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
              at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
              at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
      

      Numpy types like np.int and np.float64 should automatically be cast to the proper dtypes.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              justin.uang Justin Uang
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: