Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2198

Fix sorter spill counts

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0-alpha, 0.7.1
    • None
    • None
    • Reviewed

    Description

      Prior to pipelined shuffle, tez merged all spilled data into a single file. This ended up creating one index file and one output file. In this context, TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional spills and there was no counter needed to track the number of merges.

      With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT would be misleading, as these spills are direct output files which are consumed by the consumers.

      It would be good to have the following

      • ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task to generate the final merged output
      • TOTAL_SPILLS: represents the total number of shuffle directories (index + output files) that got created at the end of processing.

      For e.g, Assume sorter generated 5 spills in an attempt
      Without pipelining:
      ==============
      ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
      TOTAL_SPILLS = 1 <-- Final merged output

      With pipelining:
      ============
      ADDITIONAL_SPILL_COUNT = 0 <-- Additional spills involved in sorting
      TOTAL_SPILLS = 5 <--- all spills are final output

      Attachments

        1. no_additional_spills_eg_pipelined_shuffle.png
          61 kB
          Rajesh Balamohan
        2. TEZ-2198.1.patch
          22 kB
          Rajesh Balamohan
        3. TEZ-2198.2.patch
          30 kB
          Rajesh Balamohan
        4. TEZ-2198.3.patch
          31 kB
          Rajesh Balamohan
        5. TEZ-2198.4.patch
          32 kB
          Rajesh Balamohan
        6. TEZ-2198.5.patch
          32 kB
          Rajesh Balamohan
        7. TEZ-2198.6.patch
          32 kB
          Rajesh Balamohan
        8. TEZ-2198.branch-0.7.patch
          32 kB
          Rajesh Balamohan
        9. with_additional_spills.png
          64 kB
          Rajesh Balamohan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: