Description
schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
Hi, I have two dataframe, and I try above method, however, I get this
RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1
it will be really thankful, if there is any helps
PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out,
does anyone know what it cause?