Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-915

Partition Columns missing in files upserted after Metadata Bootstrap




      This issue happens in when the source data is partitioned using hive-style partitioning which is also the default behavior of spark when it writes the data. With this partitioning, the partition column/schema is never stored in the files but instead retrieved on the fly from the file paths which have partition folder in the form partition_key=partition_value.

      Now, during metadata bootstrap we store only the metadata columns in the hudi table folder. Also the bootstrap schema we are computing directly reads schema from the source data file which does not have the partition column schema in it. Thus it is not complete.

      All this manifests into issues when we ultimately do upserts on these bootstrapped files and they are fully bootstrapped. During upsert time the schema evolves because the upsert dataframe needs to have partition column in it for performing upserts. Thus ultimately the upserted rows have the correct partition column value stored, while the other records which are simply copied over from the metadata bootstrap file have missing partition column in them. Thus, we observe a different behavior here with bootstrapped vs non-bootstrapped tables.

      While this is not at the moment creating issues with Hive because it is able to determine the partition columns becuase of all the metadata it stores, however it creates a problem with other engines like Spark where the partition columns will show up as null when the upserted files are read.

      Thus, the proposal is to fix the following issues:

      • When performing bootstrap, figure out the partition schema and store it in the bootstrap schema in the commit metadata file. This would provide the following benefits:
        • From a completeness perspective this is good so that there is no behavioral changes between bootstrapped vs non-bootstrapped tables.
        • In spark bootstrap relation and incremental query relation where we need to figure out the latest schema, once can simply get the accurate schema from the commit metadata file instead of having to determine whether or not partition column is present in the schema obtained from the metadata file and if not figure out the partition schema everytime and merge (which can be expensive).
      • When doing upsert on files that are metadata bootstrapped, the partition column values should be correctly determined and copied to the upserted file to avoid missing and null values.
        • Again this is consistent behavior with non-bootstrapped tables and even though Hive seems to somehow handle this, we should consider other engines like Spark where it cannot be automatically handled.
        • Without this it will be significantly more complicated to be able to provide the partition value on read side in spark, to be able to determine everytime whether partition value is null and somehow filling it in.
        • Once the table is fully bootstrapped at some point in future, and the bootstrap commit is say cleaned up and spark querying happens through parquet datasource instead of new bootstrapped datasource, the parquet datasource will return null values wherever it find the missing partition values. In that case, we have no control over the parquet datasource as it is simply reading from the file. 


        Issue Links



              alexey.kudinkin Alexey Kudinkin
              uditme Udit Mehrotra
              0 Vote for this issue
              1 Start watching this issue