Description
I have parquet data that I load into a dataframe and then save to a datatable by doing
df.write.partitionBy("partition").parquet(tablePath)
In the table, each partition is a directory labeled like partition=2022%2F1%2F25
I then do a bootstrap by doing
import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider import org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers} import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieWriteConfig} import org.apache.hudi.keygen.SimpleKeyGenerator import org.apache.spark.sql.SaveModeimport org.apache.spark.sql.types._ val srcPath = "/Users/jon/Documents/bootstrap_testing/partitioned-parquet-table-fixed" val basePath = "/Users/jon/Documents/bootstrap_testing/tables/test8" val bootstrapDF = spark.emptyDataFramebootstrapDF.write .format("hudi") .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts") .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath) .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, classOf[SimpleKeyGenerator].getName) .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR, classOf[BootstrapRegexModeSelector].getName) .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, "2022/1/2[4-8]") .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE, "METADATA_ONLY") .option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER, classOf[SparkParquetBootstrapDataProvider].getName) .mode(SaveMode.Overwrite) .save(basePath)
that does not create any metadata_only because the regex is selecting on directory name, not partition_path, this should be clarified in the configs. I then change the regex to
partition=2022%2F1%2F2[4-8]
This properly works, but there is an isssue,
Inside the hudi table, the directories are
2022 partition=2022%2F1%2F24 partition=2022%2F1%2F25 partition=2022%2F1%2F26 partition=2022%2F1%2F27 partition=2022%2F1%2F28
The 2022 contains the FULL_BOOTSTRAP partitions but the METADATA_ONLY partitions are in those other directory.
Maybe that is ok so I try to read from the hudi table. This file contains the output from my attempt: scala_output_bootstrap1.txt
I go back to my parquet table and make a copy and move the partitions into the hudi structure where
2022->1->24
2022->1->25
...
2022-1->31
2022->2->1
....
is the directory structure. I change the regex back to how it was originally and run the bootstrap again. This time, the hudi directory contains 2022 which has the partitions that are METADATA_ONLY, but there is another directory __HIVE_DEFAULT_PARTITION that contains the FULL_BOOTSTRAP files.
When I attempt to read from the hudi table I get
scala> spark.read.format("hudi").load(basePath).createOrReplaceTempView("test_table") scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/29").count 23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__) 23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__) res16: Long = 0 scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/24").count 23/03/02 15:11:51 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__) res17: Long = 0