Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
During initialization of column_stats and bloom_filter MDT partitions, the code which creates the records for these partitions is written as such:
- Create a Map of partitionName -> List of files in partition
- Parallelize the above Map
- Each executor handles a single partition
For large datasets the above design cause the following limitations:
- Each executor handles a single partition. So we cannot speed up by throwing more executors.
- If one partitions has much larger number of files than other partitions, then a single executor would be the bottleneck for the initialization completion and other executors would be idle.
In this enhancement I am changing the parallelism to be at a file-level.