Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
Prior to pipelined shuffle, tez merged all spilled data into a single file. This ended up creating one index file and one output file. In this context, TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional spills and there was no counter needed to track the number of merges.
With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT would be misleading, as these spills are direct output files which are consumed by the consumers.
It would be good to have the following
- ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task to generate the final merged output
- TOTAL_SPILLS: represents the total number of shuffle directories (index + output files) that got created at the end of processing.
For e.g, Assume sorter generated 5 spills in an attempt
Without pipelining:
==============
ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
TOTAL_SPILLS = 1 <-- Final merged output
With pipelining:
============
ADDITIONAL_SPILL_COUNT = 0 <-- Additional spills involved in sorting
TOTAL_SPILLS = 5 <--- all spills are final output