[HUDI-3214] Optimize auto partition in spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: spark, writer-core
Labels:
- pull-request-available

Story Points:
1

Description

recently, if partition's value has the format like "pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition automatically. The directory of this table will have multi partition structure.

I think it's unpredictable. So create this umbrella task to optimize auto partition in order to make the behavior more reasonable.

Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.

There are a few of sub tasks:

add a flag to control whether enable auto-partition, to make the default behavior reasonable..
achieve a new key generator designed specifically for this scenario.
solve the bug about the different schema when enable hoodie.file.index.enable or not in this case.

Test Codes:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))

val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))

newDf.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)

Attachments

Issue Links

relates to

HUDI-3065 spark auto partition discovery does not work from 0.9.0

Closed

links to

GitHub Pull Request #5009

Activity

People

Assignee:: Yann Byron

Reporter:: Yann Byron

Reviewers:: Shiyan Xu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Jan/22 14:32

Updated:: 10/Mar/23 01:50

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified