Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3239

ShuffleVertexManager recovery issue when auto parallelism is enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • None
    • None
    • None
    • None

    Description

      Repro:

      • Enable tez.shuffle-vertex-manager.enable.auto-parallel.
      • kill the Tez AM container after the job has reached to the point that VM has reconfigured the Edge.
      • The new Tez AM attempt will fail to the following error.
      org.apache.tez.dag.api.TezUncheckedException: Atleast 1 bipartite source should exist
      at org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager.onVertexStarted(ShuffleVertexManager.java:497)
      at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventOnVertexStarted.invoke(VertexManager.java:589)
      at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:658)
      at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:653)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      

      That is because the edge routing type changed to DataMovementType.CUSTOM after reconfiguration. Allowing DataMovementType.CUSTOM in the following check seems to fix the issue.

            if (entry.getValue().getDataMovementType() == DataMovementType.SCATTER_GATHER) {
              bipartiteSources++;
            }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mingma Ming Ma
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: