[HUDI-7017] Prevent full schema evolution from wrongly falling back to OOB - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0-beta1
Component/s: None
Labels:
- pull-request-available

Description

For MOR tables that have these 2 configurations enabled:

hoodie.schema.on.read.enable=true
hoodie.datasource.read.extract.partition.values.from.path=true

BaseFileReader will use a requiredSchemaReader when reading some of the parquet files. This BaseFileReader will have an empty internalSchemaStr causing Spark3XLegacyHoodieParquetInputFormat to fall back to OOB schema evolution.

Although there are required safeguards that are added in ~~HUDI-5400~~ to force the code execution path to use Hudi Full Schema Evolution, we should still fix this so that future changes that may deprecate the use of Spark3XLegacyHoodieParquetInputFormat will not cause issues.

A sample test to invoke this:

test("Test wrong fallback to OOB schema evolution") {
  withRecordType()(withTempDir { tmp =>
    Seq("mor").foreach { tableType =>
      val tableName = generateTableName
      val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
      if (HoodieSparkUtils.gteqSpark3_1) {
        spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
        spark.sql("set hoodie.schema.on.read.enable=true")
        spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
        // NOTE: This is required since as this tests use type coercions which were only permitted in Spark 2.x
        //       and are disallowed now by default in Spark 3.x
        spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
        createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
        // date -> string
        spark.sql(s"alter table $tableName alter column col6 type String")
        checkAnswer(spark.sql(s"select col6 from $tableName where id = 1").collect())(
          Seq("2021-12-25")
        )
      }
    }
  })
}

Debugger snapshots:

As can be seen, requiredSchema (used as pruning input) has internalSchema string, but requiredDataSchema does has a null internalSchema string.