Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
In Hudi, there are two places where we need to obtain statistics on the input data
- HoodieBloomIndex : for knowing what partitions need to be loaded and checked against (is this still needed with the timeline server enabled is a separate question)
- Workload profile to get a sense of number of updates, inserts to each partition/file group
Both of them issue their own groupBy or shuffle computation today. This can be avoided using an accumulator