Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6553

Speedup column stats and bloom index creation on large datasets

    XMLWordPrintableJSON

Details

    Description

      During initialization of column_stats and bloom_filter MDT partitions, the code which creates the records for these partitions is written as such:

      1. Create a Map of partitionName -> List of files in partition
      2. Parallelize the above Map 
      3. Each executor handles a single partition

      For large datasets the above design cause the following limitations:

      1. Each executor handles a single partition. So we cannot speed up by throwing more executors.
      2. If one partitions has much larger number of files than other partitions, then a single executor would be the bottleneck for the initialization completion and other executors would be idle.

       

      In this enhancement I am changing the parallelism to be at a file-level.

       

      Attachments

        Activity

          People

            pwason Prashant Wason
            pwason Prashant Wason
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: