Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37325

Result vector from pandas_udf was not the required length

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 3.2.0
    • None
    • PySpark
    • None
    • 1

    Description

       

      schema = StructType([
      StructField("node", StringType())
      ])
      rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
      rdd = rdd.map(lambda obj: {'node': obj})
      df_node = spark.createDataFrame(rdd, schema=schema)
      
      df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
      pd_fname = df_fname.select('fname').toPandas()
      
      @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
      def udf_match(word: pd.Series) -> pd.Series:
        my_Series = pd_fname.squeeze() # dataframe to Series
        num = int(my_Series.str.contains(word.array[0]).sum())
        return pd.Series(num)
      
      df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
      

       

      Hi, I have two dataframe, and I try above method, however, I get this

      RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1

      it will be really thankful, if there is any helps

       

      PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out,

      does anyone know what it cause?

      Attachments

        Activity

          People

            Unassigned Unassigned
            biao liu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: