Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22003

vectorized reader does not work with UDF when the column is array

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      The UDF needs to deserialize the UnsafeRow. When the column type is Array, the `get` method from the ColumnVector, which is used by the vectorized reader, is called, but this method is not implemented, unfortunately.

      Code to reproduce the issue:

      val fileName = "testfile"
      val str = """{ "choices": ["key1", "key2", "key3"] }"""
      val rdd = sc.parallelize(Seq(str))
      val df = spark.read.json(rdd)
      df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ")
      
      
      import org.apache.spark.sql._
      import org.apache.spark.sql.functions._
      spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null))
      spark.read.parquet(s"file:///tmp/$fileName ").select(expr("""acf(choices)""")).show
      

        Attachments

          Activity

            People

            • Assignee:
              fengliu@databricks.com Feng Liu
              Reporter:
              liufeng.ee@gmail.com Feng Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: