Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35461

Error when reading dictionary-encoded Parquet int column when read schema is bigint

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • SQL
    • None

    Description

      When reading a dictionary-encoded integer column from a Parquet file, and users specify read schema to be bigint, Spark currently will fail with the following exception:

      java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
      	at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
      	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50)
      	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
      

      To reproduce:

          val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, i.toString))
          withParquetFile(data) { path =>
            val readSchema = StructType(Seq(StructField("_1", LongType)))
            spark.read.schema(readSchema).parquet(path).first()
          }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              csun Chao Sun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: