Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22003

vectorized reader does not work with UDF when the column is array

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      The UDF needs to deserialize the UnsafeRow. When the column type is Array, the `get` method from the ColumnVector, which is used by the vectorized reader, is called, but this method is not implemented, unfortunately.

      Code to reproduce the issue:

      val fileName = "testfile"
      val str = """{ "choices": ["key1", "key2", "key3"] }"""
      val rdd = sc.parallelize(Seq(str))
      val df = spark.read.json(rdd)
      df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ")
      
      
      import org.apache.spark.sql._
      import org.apache.spark.sql.functions._
      spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null))
      spark.read.parquet(s"file:///tmp/$fileName ").select(expr("""acf(choices)""")).show
      

      Attachments

        Activity

          People

            fengliu@databricks.com Feng Liu
            liufeng.ee@gmail.com Feng Liu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: