Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26709

OptimizeMetadataOnlyQuery does not correctly handle the files with zero record

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.1.3, 2.2.3, 2.3.2, 2.4.0
    • Fix Version/s: 2.3.3, 2.4.1, 3.0.0
    • Component/s: SQL
    • Labels:

      Description

      import org.apache.spark.sql.functions.lit
      withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
        withTempPath { path =>
          val tabLocation = path.getAbsolutePath
          val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
          val df = spark.emptyDataFrame.select(lit(1).as("col1"))
          df.write.parquet(partLocation.toString)
          val readDF = spark.read.parquet(tabLocation)
          checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
          checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
        }
      }
      

      OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the empty records for partitioned tables. The above test will fail in 2.4, which can generate an empty file, but the underlying issue in the read path still exists in 2.3, 2.2 and 2.1.

        Attachments

          Activity

            People

            • Assignee:
              Gengliang.Wang Gengliang Wang
              Reporter:
              smilegator Xiao Li
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: