[HUDI-315] Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: performance, writer-core
Labels:
None

Description

https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1

In Hudi, there are two places where we need to obtain statistics on the input data

HoodieBloomIndex : for knowing what partitions need to be loaded and checked against (is this still needed with the timeline server enabled is a separate question)
Workload profile to get a sense of number of updates, inserts to each partition/file group

Both of them issue their own groupBy or shuffle computation today. This can be avoided using an accumulator

Attachments

Activity

People

Assignee:: Yanjia Gary Li

Reporter:: Vinoth Chandar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Oct/19 06:47

Updated:: 28/Feb/20 01:58

Resolved:: 28/Feb/20 01:58