[SPARK-34897] Support reconcile schemas based on index after nested column pruning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.2, 3.1.1, 3.2.0
Fix Version/s: 3.0.3, 3.1.2, 3.2.0
Component/s: SQL
Labels:
None

Description

How to reproduce this issue:

spark.sql(
  """
    |CREATE TABLE `t1` (
    |  `_col0` INT,
    |  `_col1` STRING,
    |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
    |  `_col3` STRING)
    |USING orc
    |PARTITIONED BY (_col3)
    |""".stripMargin)

spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")

spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show

Error message:

java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. Try to disable 
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)

Attachments

Issue Links

is duplicated by

SPARK-35010 nestedSchemaPruning causes issue when reading hive generated Orc files

Resolved

SPARK-35190 all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

Resolved

SPARK-35191 all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

Resolved

is related to

HIVE-4243 Fix column names in FileSinkOperator

Closed

links to

[Github] Pull Request #31993 (wangyum)

[Github] Pull Request #32279 (wangyum)

[Github] Pull Request #32310 (wangyum)

(4 links to)

Activity

People

Assignee:: Yuming Wang

Reporter:: Yuming Wang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Mar/21 08:51

Updated:: 24/Apr/21 12:20

Resolved:: 24/Apr/21 12:18