Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34897

Support reconcile schemas based on index after nested column pruning

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.2, 3.1.1, 3.2.0
    • Fix Version/s: 3.0.3, 3.1.2, 3.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      How to reproduce this issue:

      spark.sql(
        """
          |CREATE TABLE `t1` (
          |  `_col0` INT,
          |  `_col1` STRING,
          |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
          |  `_col3` STRING)
          |USING orc
          |PARTITIONED BY (_col3)
          |""".stripMargin)
      
      spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
      
      spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
      

      Error message:

      java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. Try to disable 
      	at scala.Predef$.assert(Predef.scala:223)
      	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
      	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yumwang Yuming Wang
                Reporter:
                yumwang Yuming Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: