Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-315

Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • performance, writer-core
    • None

    Description

      https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1

      In Hudi, there are two places where we need to obtain statistics on the input data

      • HoodieBloomIndex : for knowing what partitions need to be loaded and checked against (is this still needed with the timeline server enabled is a separate question)
      • Workload profile to get a sense of number of updates, inserts to each partition/file group

      Both of them issue their own groupBy or shuffle computation today. This can be avoided using an accumulator

      Attachments

        Activity

          People

            garyli1019 Yanjia Gary Li
            vinoth Vinoth Chandar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: