Pig
  1. Pig
  2. PIG-1734

Pig needs a more efficient DAG execution

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The current code uses Hadoop's Job control to execute one stage at a time. The first stage includes all jobs with no dependencies, the second stage jobs that depend only on jobs completed in the first stage, the third stage contains the jobs that depend on jobs from stage 1 and 2, etc.

      The problem with this simplistic approach is that each next stages only starts when the previous stage is over which means means that some branches of the DAG are unnecessarily blocked.

      We would need to do our own DAG management to solve this issue which would be a pretty significant undertaking. Something we should look at in the future.

        Issue Links

          Activity

          Ahmed Radwan made changes -
          Link This issue is related to MAPREDUCE-4495 [ MAPREDUCE-4495 ]
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Hide
          Arun C Murthy added a comment -

          +1 on a more efficient DAG execution engine, and for exploring common infrastructure between Pig and Hive.

          It's hard to keep this in sync with HIVE-549, but I'll try.

          Jeff and I came up with some requirements:

          1. A way to serialize and exchange this DAG (e.g. Avro, JSON, XML)
          2. A service to execute the DAG and ensure it runs to completion
          3. Ability to modify the DAG on the fly, potentially in reaction to execution of parents of the nodes.
          4. Maybe shared infrastructure for ability to restart the necessary components of the DAG etc.

          Given the above, I do not believe Oozie is a right answer, I'd agree with Zheng (https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12805351&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12805351) that enhancing JobControl would probably be the sweet spot - this way Pig, Hive and even Oozie can use it.

          Russel Jurney has similar views against using Oozie too: https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12888870&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12888870

          Show
          Arun C Murthy added a comment - +1 on a more efficient DAG execution engine, and for exploring common infrastructure between Pig and Hive. It's hard to keep this in sync with HIVE-549 , but I'll try. Jeff and I came up with some requirements: A way to serialize and exchange this DAG (e.g. Avro, JSON, XML) A service to execute the DAG and ensure it runs to completion Ability to modify the DAG on the fly, potentially in reaction to execution of parents of the nodes. Maybe shared infrastructure for ability to restart the necessary components of the DAG etc. Given the above, I do not believe Oozie is a right answer, I'd agree with Zheng ( https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12805351&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12805351 ) that enhancing JobControl would probably be the sweet spot - this way Pig, Hive and even Oozie can use it. Russel Jurney has similar views against using Oozie too: https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12888870&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12888870
          Hide
          Jeff Hammerbacher added a comment -

          However note that the use of the workflow execution engine should not be enforced but should be optional.

          Certainly agree that we shouldn't disrupt existing users.

          Show
          Jeff Hammerbacher added a comment - However note that the use of the workflow execution engine should not be enforced but should be optional. Certainly agree that we shouldn't disrupt existing users.
          Hide
          Santhosh Srinivasan added a comment -

          +1 on the proposal to move to an external workflow execution engine. However note that the use of the workflow execution engine should not be enforced but should be optional.

          Show
          Santhosh Srinivasan added a comment - +1 on the proposal to move to an external workflow execution engine. However note that the use of the workflow execution engine should not be enforced but should be optional.
          Jeff Hammerbacher made changes -
          Link This issue is related to HIVE-1107 [ HIVE-1107 ]
          Hide
          Jeff Hammerbacher added a comment -

          Some work in this direction has been done by the Hive team (HIVE-549). There has also been a proposal for Pig and Hive to unify their plan execution frameworks (HIVE-1107), potentially using Oozie.

          Show
          Jeff Hammerbacher added a comment - Some work in this direction has been done by the Hive team ( HIVE-549 ). There has also been a proposal for Pig and Hive to unify their plan execution frameworks ( HIVE-1107 ), potentially using Oozie.
          Jeff Hammerbacher made changes -
          Field Original Value New Value
          Link This issue is related to HIVE-549 [ HIVE-549 ]
          Olga Natkovich created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Olga Natkovich
            • Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development