Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5871

Bootstrap does not work with partitions with /

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • bootstrap, spark
    • None

    Description

      I have parquet data that I load into a dataframe and then save to a datatable by doing 

       

      df.write.partitionBy("partition").parquet(tablePath) 

      In the table, each partition is a directory labeled like partition=2022%2F1%2F25

       

      I then do a bootstrap by doing

       

      import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
      import org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
      import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers}
      import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieWriteConfig}
      import org.apache.hudi.keygen.SimpleKeyGenerator
      import org.apache.spark.sql.SaveModeimport org.apache.spark.sql.types._
      val srcPath = "/Users/jon/Documents/bootstrap_testing/partitioned-parquet-table-fixed"
      val basePath = "/Users/jon/Documents/bootstrap_testing/tables/test8"
      val bootstrapDF = spark.emptyDataFramebootstrapDF.write
          .format("hudi")      
      .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition")      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")      .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)      .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, classOf[SimpleKeyGenerator].getName)      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR, classOf[BootstrapRegexModeSelector].getName)      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, "2022/1/2[4-8]")      .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE, "METADATA_ONLY")      .option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER, classOf[SparkParquetBootstrapDataProvider].getName) 
      .mode(SaveMode.Overwrite)
      .save(basePath)
      

      that does not create any metadata_only because the regex is selecting on directory name, not partition_path, this should be clarified in the configs. I then change the regex to

      partition=2022%2F1%2F2[4-8] 

      This properly works, but there is an isssue,

      Inside the hudi table, the directories are 

      2022			partition=2022%2F1%2F24	partition=2022%2F1%2F25	partition=2022%2F1%2F26	partition=2022%2F1%2F27	partition=2022%2F1%2F28 

      The 2022 contains the FULL_BOOTSTRAP partitions but the METADATA_ONLY partitions are in those other directory. 

      Maybe that is ok so I try to read from the hudi table. This file contains the output from my attempt: scala_output_bootstrap1.txt 

      I go back to my parquet table and make a copy and move the partitions into the hudi structure where 

      2022->1->24

      2022->1->25

      ...

      2022-1->31

      2022->2->1

      ....

      is the directory structure. I change the regex back to how it was originally and run the bootstrap again. This time, the hudi directory contains 2022 which has the partitions that are METADATA_ONLY, but there is another directory __HIVE_DEFAULT_PARTITION that contains the FULL_BOOTSTRAP files. 

      When I attempt to read from the hudi table I get 

      scala> spark.read.format("hudi").load(basePath).createOrReplaceTempView("test_table")
      
      scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/29").count
      23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
      23/03/02 15:11:42 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
      res16: Long = 0
      
      scala> spark.sql("select * from test_table where _hoodie_partition_path=2022/1/24").count
      23/03/02 15:11:51 WARN HFileBootstrapIndex: No value found for partition key (__HIVE_DEFAULT_PARTITION__)
      res17: Long = 0 

      Attachments

        1. scala_output_bootstrap1.txt
          65 kB
          Jonathan Vexler

        Activity

          People

            Unassigned Unassigned
            jonvex Jonathan Vexler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: