Details
-
Improvement
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
-
1
Description
recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure.
I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable.
Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
There are a few of sub tasks:
- add a flag to control whether enable auto-partition, to make the default behavior reasonable..
- achieve a new key generator designed specifically for this scenario.
- solve the bug about the different schema when enable hoodie.file.index.enable or not in this case.
Test Codes:
import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) newDf.write.format("hudi"). options(getQuickstartWriteConfigs). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). mode(Overwrite). save(basePath)
Attachments
Issue Links
- relates to
-
HUDI-3065 spark auto partition discovery does not work from 0.9.0
- Closed
- links to