[SPARK-26709] OptimizeMetadataOnlyQuery does not correctly handle the files with zero record - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.1.0, 2.1.3, 2.2.3, 2.3.2, 2.4.0
Fix Version/s: 2.3.3, 2.4.1, 3.0.0
Component/s: SQL
Labels:
- correctness

Description

import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
    val tabLocation = path.getAbsolutePath
    val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
    val df = spark.emptyDataFrame.select(lit(1).as("col1"))
    df.write.parquet(partLocation.toString)
    val readDF = spark.read.parquet(tabLocation)
    checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
    checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the empty records for partitioned tables. The above test will fail in 2.4, which can generate an empty file, but the underlying issue in the read path still exists in 2.3, 2.2 and 2.1.

Attachments

Issue Links

is caused by

SPARK-15752 Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators

Resolved

is duplicated by

SPARK-26996 Scalar Subquery not handled properly in Spark 2.4

Closed

is related to

SPARK-34194 Queries that only touch partition columns shouldn't scan through all files

Resolved

links to

[Github] Pull Request #23648 (gengliangwang)

GitHub Pull Request #23635

GitHub Pull Request #23648

(2 links to)

Activity

People

Assignee:: Gengliang Wang

Reporter:: Xiao Li

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Jan/19 22:56

Updated:: 02/Feb/21 05:55

Resolved:: 26/Jan/19 00:29