[SPARK-12157] Support numpy types as return values of Python UDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.5.2
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
- bulk-closed

Description

Currently, if I have a python UDF

import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row
import numpy as np

argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())

df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df.select(argmax("array")).count()

I get an exception that is fairly opaque:

Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
        at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
        at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
        at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
        at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
        at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)

Numpy types like np.int and np.float64 should automatically be cast to the proper dtypes.

Attachments

Issue Links

is duplicated by

SPARK-19087 Numpy types fail to be casted to any other types

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Justin Uang

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 05/Dec/15 16:53

Updated:: 21/May/19 04:16

Resolved:: 21/May/19 04:16