Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1649

ShuffleVertexManager auto reduce parallelism can cause jobs to hang indefinitely (with ScatterGather edges)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.2
    • None
    • None
    • Reviewed

    Description

      Consider the following DAG
      M1, M2 --> R1
      M2, M3 --> R2
      R1 --> R2

      All edges are Scatter-Gather.
      1. Set R1's (1000 parallelism) min/max setting to 0.25 - 0.5f
      2. Set R2's (21 parallelism) min/max setting to 0.2 and 0.3f
      3. Let M1 send some data from HDFS (test.txt)
      4. Let M2 (50 parallelism) generate some data and send it to R2
      5. Let M3 (500 parallelism) generate some data and send it to R2

      • Since R2's min/max can get satisfied by getting events from M3 itself, R2 will change its parallelism quickly than R1.
      • In the mean time, R1 changes its parallelism from 1000 to 20. This is not propagated to R2 and it would keep waiting.

      Tested this on a small scale (20 node) cluster and it happens consistently.

      Attachments

        1. TEZ-1649.1.patch
          13 kB
          Rajesh Balamohan
        2. TEZ-1649.2.patch
          13 kB
          Rajesh Balamohan
        3. TEZ-1649.3.patch
          37 kB
          Rajesh Balamohan
        4. TEZ-1649.4.patch
          52 kB
          Rajesh Balamohan
        5. TEZ-1649.png
          88 kB
          Rajesh Balamohan

        Issue Links

          Activity

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: