Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.1.0, 2.1.3, 2.2.3, 2.3.2, 2.4.0
Description
import org.apache.spark.sql.functions.lit withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") { withTempPath { path => val tabLocation = path.getAbsolutePath val partLocation = new Path(path.getAbsolutePath, "partCol1=3") val df = spark.emptyDataFrame.select(lit(1).as("col1")) df.write.parquet(partLocation.toString) val readDF = spark.read.parquet(tabLocation) checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null)) checkAnswer(readDF.selectExpr("max(col1)"), Row(null)) } }
OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the empty records for partitioned tables. The above test will fail in 2.4, which can generate an empty file, but the underlying issue in the read path still exists in 2.3, 2.2 and 2.1.
Attachments
Issue Links
- is caused by
-
SPARK-15752 Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators
- Resolved
- is duplicated by
-
SPARK-26996 Scalar Subquery not handled properly in Spark 2.4
- Closed
- is related to
-
SPARK-34194 Queries that only touch partition columns shouldn't scan through all files
- Resolved
- links to