Description
Task Execution Summary -------------------------------------------------------------------------------------------------------------------------------- VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS -------------------------------------------------------------------------------------------------------------------------------- Map 1 167 0 0 17640.00 2,109,200 23,068 150,000,004 11,995,136 Map 11 5 0 0 10559.00 71,960 633 4,023,690 799,900 Map 13 1 0 0 2244.00 6,090 29 25 3 Map 3 1 0 0 2849.00 7,080 99 25 3 Map 5 271 0 0 55834.00 12,934,890 358,376 1,500,000,001 1,500,000,161 Map 7 241 0 0 91243.00 5,020,860 71,182 1,827,250,341 652,413,443 Reducer 10 1 0 0 1010.00 1,900 0 4 0 Reducer 12 1 0 0 3854.00 1,320 0 799,900 1 Reducer 14 1 0 0 1420.00 3,790 45 3 1 Reducer 2 1 0 0 9720.00 6,220 122 11,995,136 1 Reducer 4 1 0 0 810.00 2,100 105 3 1 Reducer 6 1 0 0 24863.00 3,260 5 1,500,000,161 1 Reducer 8 412 0 0 88215.00 17,106,440 184,524 2,165,208,640 1,864 Reducer 9 2 0 0 29752.00 3,980 0 1,864 4 --------------------------------------------------------------------------------------------------------------------
Seeing this on queries using runtime filtering. Noticed the INPUT_RECORDS look incorrect for the reducers that are responsible for aggregating the min/max/bloomfilter (Reducers 12, 14, 2, 6). For example Reducer 2 shows 12M input records. However looking at the task logs for Reducer 2, there were only 167 input records.
It looks like Map 1 has 2 different output vertices (Reducer 2 and Reducer 8), but the total output rows for Map 1 (rather than just the rows going to each specific vertex) is being counted in the input rows for both Reducer 2 and Reducer 8.