Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3214

Optimize auto partition in spark

    XMLWordPrintableJSON

Details

    • 1

    Description

      recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure.

      I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable.

      Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.

      There are a few of sub tasks:

      • add a flag to control whether enable auto-partition, to make the default behavior reasonable..
      • achieve a new key generator designed specifically for this scenario.
      • solve the bug about the different schema when enable hoodie.file.index.enable or not in this case.

       

      Test Codes: 

      import org.apache.hudi.QuickstartUtils._
      import scala.collection.JavaConversions._
      import org.apache.spark.sql.SaveMode._
      import org.apache.hudi.DataSourceReadOptions._
      import org.apache.hudi.DataSourceWriteOptions._
      import org.apache.hudi.config.HoodieWriteConfig._
      
      val tableName = "hudi_trips_cow"
      val basePath = "file:///tmp/hudi_trips_cow"
      val dataGen = new DataGenerator
      val inserts = convertToStringList(dataGen.generateInserts(10))
      
      val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
      val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
      
      newDf.write.format("hudi").
      options(getQuickstartWriteConfigs).
      option(PRECOMBINE_FIELD_OPT_KEY, "ts").
      option(RECORDKEY_FIELD_OPT_KEY, "uuid").
      option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
      option(TABLE_NAME, tableName).
      mode(Overwrite).
      save(basePath) 

      Attachments

        Issue Links

          Activity

            People

              biyan900116@gmail.com Yann Byron
              biyan900116@gmail.com Yann Byron
              Raymond Xu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified