[SPARK-37325] Result vector from pandas_udf was not the required length - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: PySpark
Labels:
None
Environment:

1

Language:
- Python

Description

schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))

Hi, I have two dataframe, and I try above method, however, I get this

RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1

it will be really thankful, if there is any helps

PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out,

does anyone know what it cause?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: liu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Nov/21 03:41

Updated:: 12/Dec/22 18:11

Resolved:: 25/Nov/21 03:01