Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7017

Prevent full schema evolution from wrongly falling back to OOB

    XMLWordPrintableJSON

Details

    Description

      For MOR tables that have these 2 configurations enabled:

       

      hoodie.schema.on.read.enable=true
      hoodie.datasource.read.extract.partition.values.from.path=true

       

       

      BaseFileReader will use a requiredSchemaReader when reading some of the parquet files. This BaseFileReader will have an empty internalSchemaStr causing Spark3XLegacyHoodieParquetInputFormat to fall back to OOB schema evolution.

       

      Although there are required safeguards that are added in HUDI-5400 to force the code execution path to use Hudi Full Schema Evolution, we should still fix this so that future changes that may deprecate the use of Spark3XLegacyHoodieParquetInputFormat will not cause issues.

       

      A sample test to invoke this:

      test("Test wrong fallback to OOB schema evolution") {
        withRecordType()(withTempDir { tmp =>
          Seq("mor").foreach { tableType =>
            val tableName = generateTableName
            val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
            if (HoodieSparkUtils.gteqSpark3_1) {
              spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
              spark.sql("set hoodie.schema.on.read.enable=true")
              spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
              // NOTE: This is required since as this tests use type coercions which were only permitted in Spark 2.x
              //       and are disallowed now by default in Spark 3.x
              spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
              createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
              // date -> string
              spark.sql(s"alter table $tableName alter column col6 type String")
              checkAnswer(spark.sql(s"select col6 from $tableName where id = 1").collect())(
                Seq("2021-12-25")
              )
            }
          }
        })
      } 

       

      Debugger snapshots:

      As can be seen, requiredSchema (used as pruning input) has internalSchema string, but requiredDataSchema does has a null internalSchema string.

       

      As a result, the internalSchemaStr that is passed into Spark3XLegacyHoodieParquetFileFormat is null (which should not be the case)

       

      Attachments

        Issue Links

          Activity

            People

              voonhous voon
              voonhous voon
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: