Hive
  1. Hive
  2. HIVE-1107

Generic parallel execution framework for Hive (and Pig, and ...)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None
    • Tags:
      plan execution parallel workflow

      Description

      Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments.

        Issue Links

          Activity

          Hide
          Keren Ouaknine added a comment -

          Thanks Bertrand, once we have the MR job out of Algebricks we can add YARN compatibility at some point. The advantage being there's one plan to support rather than one for each query language.

          we implemented Pig2Algebricks for a join query and also wrote up the design for Algebricks2MR. Design details, details on: http://www.kereno.com/gp/2012/09/09/convert-algebricks-to-mr-design-and-implementation/
          Currently working on its implementation.

          Ahmed, thanks for the link to 4495. Will keep on tracking this Jira.

          Show
          Keren Ouaknine added a comment - Thanks Bertrand, once we have the MR job out of Algebricks we can add YARN compatibility at some point. The advantage being there's one plan to support rather than one for each query language. we implemented Pig2Algebricks for a join query and also wrote up the design for Algebricks2MR. Design details, details on: http://www.kereno.com/gp/2012/09/09/convert-algebricks-to-mr-design-and-implementation/ Currently working on its implementation. Ahmed, thanks for the link to 4495. Will keep on tracking this Jira.
          Hide
          Ahmed Radwan added a comment -

          I wanted to point out that there is work on Yarn to add a new Workflow Application Master (See MAPREDUCE-4495). It'd be helpful for the Hive community to look into that and provide early feedback. I am envisioning such WorkFlow Application Master can be used by projects like Hive, Pig, etc. so they'll share a common workflow engine.

          Show
          Ahmed Radwan added a comment - I wanted to point out that there is work on Yarn to add a new Workflow Application Master (See MAPREDUCE-4495 ). It'd be helpful for the Hive community to look into that and provide early feedback. I am envisioning such WorkFlow Application Master can be used by projects like Hive, Pig, etc. so they'll share a common workflow engine.
          Hide
          Bertrand Dechoux added a comment -

          "A port of hive and pig to YARN" wouldn't that be a way to explain the project?
          As far as I know (and I don't know much) the support of YARN by Pig/Hive is not mature and your project could be to provide a common ground for their migration. (Or at least a demonstration that it is possible and what is the ratio effort/advantage).

          Show
          Bertrand Dechoux added a comment - "A port of hive and pig to YARN" wouldn't that be a way to explain the project? As far as I know (and I don't know much) the support of YARN by Pig/Hive is not mature and your project could be to provide a common ground for their migration. (Or at least a demonstration that it is possible and what is the ratio effort/advantage).
          Hide
          Keren Ouaknine added a comment -

          Hello, I am blogging progress on the project on:
          http://www.kereno.com/gp/

          All comments are welcome, thank you!

          Show
          Keren Ouaknine added a comment - Hello, I am blogging progress on the project on: http://www.kereno.com/gp/ All comments are welcome, thank you!
          Hide
          Milind Bhandarkar added a comment -

          I have requested Keren to look at Hyracks (http://code.google.com/p/hyracks/) and Asterix. This is built by Mike Carey's research group at UC Irvine, and recently ex-Yahoo! researchers (Marcus Weimer & Tyson Condie) produced some interesting results using Hyracks. Please note that this is exploratory work right now, and just a proposed approach. Not a final choice.

          Show
          Milind Bhandarkar added a comment - I have requested Keren to look at Hyracks ( http://code.google.com/p/hyracks/ ) and Asterix. This is built by Mike Carey's research group at UC Irvine, and recently ex-Yahoo! researchers (Marcus Weimer & Tyson Condie) produced some interesting results using Hyracks. Please note that this is exploratory work right now, and just a proposed approach. Not a final choice.
          Hide
          Carl Steinbach added a comment -

          @Milind: Which open-source execution engine are you using?

          Show
          Carl Steinbach added a comment - @Milind: Which open-source execution engine are you using?
          Hide
          Milind Bhandarkar added a comment -

          @jco, @squarecog et al.

          I have had initial discussions with Keren about this project. I agree that full production-quality unification is a lot of work, and would require super-human effort to get it done in 4 months. So, Keren's internship with Greenplum will focus on producing a prototype. The approach I have proposed to her involves taking an open-source execution engine, which has superset of these operators, which is already proven to work with Hadoop (in fact has Hive implemented on top of it already), and make sure a subset of pig operators are ported to it.

          Show
          Milind Bhandarkar added a comment - @jco, @squarecog et al. I have had initial discussions with Keren about this project. I agree that full production-quality unification is a lot of work, and would require super-human effort to get it done in 4 months. So, Keren's internship with Greenplum will focus on producing a prototype. The approach I have proposed to her involves taking an open-source execution engine, which has superset of these operators, which is already proven to work with Hadoop (in fact has Hive implemented on top of it already), and make sure a subset of pig operators are ported to it.
          Hide
          Jonathan Coveney added a comment -

          Pig committers coming out of the woodwork

          Keren: I really like this idea in the abstract, and have talked with many people about it. It's on everyone's mind.

          That said, I agree completely with dmitriy. Proving that you can pipe one random unified operator through Pig and Hive isn't going to prove very much. The hard part is going to be creating a system generic enough to handle the diverse object models, extension API's (UDF's, load funcs, etc), as well as the act of decoupling highly pig or hive specific code from their respective logical plans. Obviously if you just do a "load + foreach," it's going to be much easier than building a system that can handle the extensibility people count on.

          Godspeed. I'll definitely read anything you guys propose. CC the pig listserv

          Show
          Jonathan Coveney added a comment - Pig committers coming out of the woodwork Keren: I really like this idea in the abstract, and have talked with many people about it. It's on everyone's mind. That said, I agree completely with dmitriy. Proving that you can pipe one random unified operator through Pig and Hive isn't going to prove very much. The hard part is going to be creating a system generic enough to handle the diverse object models, extension API's (UDF's, load funcs, etc), as well as the act of decoupling highly pig or hive specific code from their respective logical plans. Obviously if you just do a "load + foreach," it's going to be much easier than building a system that can handle the extensibility people count on. Godspeed. I'll definitely read anything you guys propose. CC the pig listserv
          Hide
          Dmitriy V. Ryaboy added a comment -

          I don't think comparing implementations of individual operators is going to lead to anything useful – one, because they work in the larger context of their respective frameworks, and two, because the real value is in converging on a single optimizer, not on "best" implementations of filters or joins.

          Show
          Dmitriy V. Ryaboy added a comment - I don't think comparing implementations of individual operators is going to lead to anything useful – one, because they work in the larger context of their respective frameworks, and two, because the real value is in converging on a single optimizer, not on "best" implementations of filters or joins.
          Hide
          Edward Capriolo added a comment -

          It would be an interesting idea to see a multi-stage hive job just turn into an oozie DAG.

          Show
          Edward Capriolo added a comment - It would be an interesting idea to see a multi-stage hive job just turn into an oozie DAG.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks, Keren for picking this up!
          I think there is some interest in exploring this idea and there is certainly a value in prototyping this. If you have any more details on design in addition to the pdf you linked, do share. I am sure interested folks will provide u with useful feedback.

          Show
          Ashutosh Chauhan added a comment - Thanks, Keren for picking this up! I think there is some interest in exploring this idea and there is certainly a value in prototyping this. If you have any more details on design in addition to the pdf you linked, do share. I am sure interested folks will provide u with useful feedback.
          Hide
          Keren Ouaknine added a comment -

          Thanks Edward. This project is merely a prototype for two operators. With the input from the community, we intend to suggest a design and a couple of implementation examples. If you have time, I would be happy to discuss it with you (I am more familiar with Pig layers than Hive's and your help would be much appreciated).

          Show
          Keren Ouaknine added a comment - Thanks Edward. This project is merely a prototype for two operators. With the input from the community, we intend to suggest a design and a couple of implementation examples. If you have time, I would be happy to discuss it with you (I am more familiar with Pig layers than Hive's and your help would be much appreciated).
          Hide
          Edward Capriolo added a comment -

          That seems like an incredible amount of work to be done in 4 months.

          Show
          Edward Capriolo added a comment - That seems like an incredible amount of work to be done in 4 months.
          Hide
          Keren Ouaknine added a comment -

          Hello, I would like to unify the execution layers of Pig and Hive. Greenplum hired me for this project and I will be working on it during this summer (June to September).
          As part of this project, we would also like to integrate MRv2 logic and advantages to the MR plans generated by these query languages.

          The idea is to choose a couple of operators, compare their execution and merge them (or merely choose one).
          The advantage of this project is to have Pig and Hive developers working on a single compiler and optimizer
          of the MR plan. I would love to hear your comments.

          For more details on this project:
          http://kereno.com/Plans_for_unification_of_Pig_and_Hive.pdf

          Show
          Keren Ouaknine added a comment - Hello, I would like to unify the execution layers of Pig and Hive. Greenplum hired me for this project and I will be working on it during this summer (June to September). As part of this project, we would also like to integrate MRv2 logic and advantages to the MR plans generated by these query languages. The idea is to choose a couple of operators, compare their execution and merge them (or merely choose one). The advantage of this project is to have Pig and Hive developers working on a single compiler and optimizer of the MR plan. I would love to hear your comments. For more details on this project: http://kereno.com/Plans_for_unification_of_Pig_and_Hive.pdf
          Hide
          Carl Steinbach added a comment -

          Hive/Pig might want to change the #reduces of job3 in the pipeline after looking at the output of job1 and job2 - this is not necessarily changing the DAG itself, but changing the components of the DAG.

          Oozie already supports this:
          http://yahoo.github.com/oozie/releases/2.0.0/WorkflowFunctionalSpec.html#DecisionNode
          http://yahoo.github.com/oozie/releases/2.0.0/WorkflowFunctionalSpec.html#WorkflowELSupport

          Show
          Carl Steinbach added a comment - Hive/Pig might want to change the #reduces of job3 in the pipeline after looking at the output of job1 and job2 - this is not necessarily changing the DAG itself, but changing the components of the DAG. Oozie already supports this: http://yahoo.github.com/oozie/releases/2.0.0/WorkflowFunctionalSpec.html#DecisionNode http://yahoo.github.com/oozie/releases/2.0.0/WorkflowFunctionalSpec.html#WorkflowELSupport
          Hide
          Arun C Murthy added a comment -

          Adaptive query optimization is indeed a noble goal. Oozie seems to think at the level of workflow rather than dataflow, so as you say, it may not be an appropriate layer for performing these optimizations. I'm not sure if it detracts from the ability of Hive or Pig to perform adaptive query optimization though, either.

          Anyways, thanks for the discussion. We're certainly thinking through these issues as well.

          Yep, this is a fun discussion, thanks to you too.

          A simple example:

          Hive/Pig might want to change the #reduces of job3 in the pipeline after looking at the output of job1 and job2 - this is not necessarily changing the DAG itself, but changing the components of the DAG.

          As you point out, Oozie is at the level of workflow, not dataflow, and thus might be cumbersome to deal with for these purposes. Sure, you could support this Oozie, but I'm not sure it is the right thing to do.


          In some way, my sense is that we need a common DAG-execution library for Pig and Hive, not a DAG-execution framework. Thoughts?

          Show
          Arun C Murthy added a comment - Adaptive query optimization is indeed a noble goal. Oozie seems to think at the level of workflow rather than dataflow, so as you say, it may not be an appropriate layer for performing these optimizations. I'm not sure if it detracts from the ability of Hive or Pig to perform adaptive query optimization though, either. Anyways, thanks for the discussion. We're certainly thinking through these issues as well. Yep, this is a fun discussion, thanks to you too. A simple example: Hive/Pig might want to change the #reduces of job3 in the pipeline after looking at the output of job1 and job2 - this is not necessarily changing the DAG itself, but changing the components of the DAG. As you point out, Oozie is at the level of workflow, not dataflow, and thus might be cumbersome to deal with for these purposes. Sure, you could support this Oozie, but I'm not sure it is the right thing to do. In some way, my sense is that we need a common DAG-execution library for Pig and Hive, not a DAG-execution framework. Thoughts?
          Hide
          Jeff Hammerbacher added a comment -

          Gah, can't edit, but of course I meant "objections", not "objects".

          Show
          Jeff Hammerbacher added a comment - Gah, can't edit, but of course I meant "objections", not "objects".
          Hide
          Jeff Hammerbacher added a comment -

          Okay, thanks. Let me try to pull apart the issues so that I can understand them:

          Oozie is more complex than Pig and HIVE put together Compare their manuals, both in terms of length and readability.

          Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it.

          Also, there is no need to force Oozie either, people can use Azkaban etc. for workflow.

          Each of these objects seem moot, given that Oozie would be targeted by the Hive and Pig developers, not the Hive and Pig users. No Hive or Pig user would be required to write Oozie: the configuration files would be generated by the Hive and Pig query planners, from my understanding.

          I believe, mid-to-long term, that Pig/Hive will get significantly smarter about the way they construct MR jobs - they will want to run some of the nodes in the DAG, wait for their output (e.g. a sampler) and then make ever more complicated decisions to modify the DAG. I believe Oozie isn't the right tool to be using for this purpose.

          Adaptive query optimization is indeed a noble goal. Oozie seems to think at the level of workflow rather than dataflow, so as you say, it may not be an appropriate layer for performing these optimizations. I'm not sure if it detracts from the ability of Hive or Pig to perform adaptive query optimization though, either.

          Anyways, thanks for the discussion. We're certainly thinking through these issues as well.

          Show
          Jeff Hammerbacher added a comment - Okay, thanks. Let me try to pull apart the issues so that I can understand them: Oozie is more complex than Pig and HIVE put together Compare their manuals, both in terms of length and readability. Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it. Also, there is no need to force Oozie either, people can use Azkaban etc. for workflow. Each of these objects seem moot, given that Oozie would be targeted by the Hive and Pig developers, not the Hive and Pig users. No Hive or Pig user would be required to write Oozie: the configuration files would be generated by the Hive and Pig query planners, from my understanding. I believe, mid-to-long term, that Pig/Hive will get significantly smarter about the way they construct MR jobs - they will want to run some of the nodes in the DAG, wait for their output (e.g. a sampler) and then make ever more complicated decisions to modify the DAG. I believe Oozie isn't the right tool to be using for this purpose. Adaptive query optimization is indeed a noble goal. Oozie seems to think at the level of workflow rather than dataflow, so as you say, it may not be an appropriate layer for performing these optimizations. I'm not sure if it detracts from the ability of Hive or Pig to perform adaptive query optimization though, either. Anyways, thanks for the discussion. We're certainly thinking through these issues as well.
          Hide
          Arun C Murthy added a comment -

          I think Russel did a good job explaining it.

          I'll add some more:

          I believe, mid-to-long term, that Pig/Hive will get significantly smarter about the way they construct MR jobs - they will want to run some of the nodes in the DAG, wait for their output (e.g. a sampler) and then make ever more complicated decisions to modify the DAG. I believe Oozie isn't the right tool to be using for this purpose. Also, there is no need to force Oozie either, people can use Azkaban etc. for workflow.

          Show
          Arun C Murthy added a comment - I think Russel did a good job explaining it. I'll add some more: I believe, mid-to-long term, that Pig/Hive will get significantly smarter about the way they construct MR jobs - they will want to run some of the nodes in the DAG, wait for their output (e.g. a sampler) and then make ever more complicated decisions to modify the DAG. I believe Oozie isn't the right tool to be using for this purpose. Also, there is no need to force Oozie either, people can use Azkaban etc. for workflow.
          Hide
          Jeff Hammerbacher added a comment -

          I agree with Russell that Oozie seems too complicated for this task.

          Could you provide more color here? What aspects of Oozie make it too complicated for this task?

          Show
          Jeff Hammerbacher added a comment - I agree with Russell that Oozie seems too complicated for this task. Could you provide more color here? What aspects of Oozie make it too complicated for this task?
          Hide
          Arun C Murthy added a comment -

          +1 on the direction to get Pig and Hive to use common infrastructure for DAG execution.

          1) A way to serialize and exchange this DAG (e.g. Avro, JSON, XML)
          2) A service to execute the DAG and ensure it runs to completion

          +1

          Some more:

          1. Ability to modify the DAG on the fly, potentially in reaction to execution of parents of the nodes.
          2. Maybe shared infrastructure for ability to restart the necessary components of the DAG etc.

          I agree with Russell that Oozie seems too complicated for this task.

          Potentially, as Zheng suggested, a generalized form of JobControl from Map-Reduce could be the answer, it could be something that Pig, Hive and potentially even Oozie can co-opt.

          Show
          Arun C Murthy added a comment - +1 on the direction to get Pig and Hive to use common infrastructure for DAG execution. 1) A way to serialize and exchange this DAG (e.g. Avro, JSON, XML) 2) A service to execute the DAG and ensure it runs to completion +1 Some more: Ability to modify the DAG on the fly, potentially in reaction to execution of parents of the nodes. Maybe shared infrastructure for ability to restart the necessary components of the DAG etc. I agree with Russell that Oozie seems too complicated for this task. Potentially, as Zheng suggested, a generalized form of JobControl from Map-Reduce could be the answer, it could be something that Pig, Hive and potentially even Oozie can co-opt.
          Hide
          Carl Steinbach added a comment -

          The only simple interface to Oozie is a proprietary GUI.

          Which Oozie GUI are you talking about? Can you provide a link? I'd really like to check this out.

          Show
          Carl Steinbach added a comment - The only simple interface to Oozie is a proprietary GUI. Which Oozie GUI are you talking about? Can you provide a link? I'd really like to check this out.
          Hide
          Jeff Hammerbacher added a comment -

          Russell,

          Let's not focus too hard on the name of the particular workflow execution engine.

          The idea here is that a program of some sort (Hive query or set of Pig statements) must be processed and a physical plan of MapReduce operators produced. Once you have a DAG of operators to carry out, you need:

          1) A way to serialize and exchange this DAG (e.g. Avro, JSON, XML)
          2) A service to execute the DAG and ensure it runs to completion

          Of course, things aren't this simple; for example, we need a consistent way to handle side data generated by an operator.

          The goal of this proposal was to encourage Hive and Pig to target the same plan serialization format so that a single plan execution engine could be used. That way, work that is done on monitoring, capturing metadata from, and ensuring the reliability of multi-stage DAGs of MapReduce can be reused rather than reimplemented in each system.

          Some arguments against this idea: component modularity can introduce inefficiencies, may make the overall system feel more complex, and does not deliver user-visible features despite the large effort required for implementation.

          I believe the convergence of Pig and Hive on this front would be beneficial to the larger Hadoop community, but it's a large undertaking, and each organization has their own goals for their infrastructure.

          Later,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Russell, Let's not focus too hard on the name of the particular workflow execution engine. The idea here is that a program of some sort (Hive query or set of Pig statements) must be processed and a physical plan of MapReduce operators produced. Once you have a DAG of operators to carry out, you need: 1) A way to serialize and exchange this DAG (e.g. Avro, JSON, XML) 2) A service to execute the DAG and ensure it runs to completion Of course, things aren't this simple; for example, we need a consistent way to handle side data generated by an operator. The goal of this proposal was to encourage Hive and Pig to target the same plan serialization format so that a single plan execution engine could be used. That way, work that is done on monitoring, capturing metadata from, and ensuring the reliability of multi-stage DAGs of MapReduce can be reused rather than reimplemented in each system. Some arguments against this idea: component modularity can introduce inefficiencies, may make the overall system feel more complex, and does not deliver user-visible features despite the large effort required for implementation. I believe the convergence of Pig and Hive on this front would be beneficial to the larger Hadoop community, but it's a large undertaking, and each organization has their own goals for their infrastructure. Later, Jeff
          Hide
          Russell Jurney added a comment -

          At Jeff's suggestion, my comments on this ticket for Hive and Pig follow.

          Oozie has been suggested as a solution to this ticket, but it is in my opinion far too complex to be appropriate for Pig or HIVE. A scheduler should not be more complex than the language it schedules, and Oozie is more complex than Pig and HIVE put together. Compare their manuals, both in terms of length and readability. Furthermore, Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it.

          Pig and HIVE aim to deliver simplicity and accessibility. In time Oozie may mature, but it is not there yet. The features are present, but the open source interface is extremely raw. The only simple interface to Oozie is a proprietary GUI. Perhaps the next major release will be an improvement.

          A tight binding between these projects would cause LinkedIn problems, as we use Azkaban to schedule pig jobs. Scheduling a job in Azkaban consists of creating a zip file of your job's content, inserting a very brief config (typically 3-6 lines), and issuing a one-line command. The web interface to Azkaban is free. This makes it a more appropriate choice for this ticket than Oozie, but making Azkaban tightly bound to Pig would be a terrible idea too.

          We should be very careful about adding enterprise baggage to these tools that is simply not needed for the vast majority of users. Convention over configuration is at the core of Pig and HIVE. Lets not spoil that.

          Show
          Russell Jurney added a comment - At Jeff's suggestion, my comments on this ticket for Hive and Pig follow. Oozie has been suggested as a solution to this ticket, but it is in my opinion far too complex to be appropriate for Pig or HIVE. A scheduler should not be more complex than the language it schedules, and Oozie is more complex than Pig and HIVE put together. Compare their manuals, both in terms of length and readability. Furthermore, Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it. Pig and HIVE aim to deliver simplicity and accessibility. In time Oozie may mature, but it is not there yet. The features are present, but the open source interface is extremely raw. The only simple interface to Oozie is a proprietary GUI. Perhaps the next major release will be an improvement. A tight binding between these projects would cause LinkedIn problems, as we use Azkaban to schedule pig jobs. Scheduling a job in Azkaban consists of creating a zip file of your job's content, inserting a very brief config (typically 3-6 lines), and issuing a one-line command. The web interface to Azkaban is free. This makes it a more appropriate choice for this ticket than Oozie, but making Azkaban tightly bound to Pig would be a terrible idea too. We should be very careful about adding enterprise baggage to these tools that is simply not needed for the vast majority of users. Convention over configuration is at the core of Pig and HIVE. Lets not spoil that.
          Hide
          Zheng Shao added a comment -

          Hadoop has the JobControl classes which can be generalized to support our need.

          The current major limitations of JobControl are:
          1. No way to add jobs that are non-mapreduce. Hive has a lot of other jobs as well, including MoveTask, etc.
          2. No way to serialize the jobs and resume the progress at a later time.

          Show
          Zheng Shao added a comment - Hadoop has the JobControl classes which can be generalized to support our need. The current major limitations of JobControl are: 1. No way to add jobs that are non-mapreduce. Hive has a lot of other jobs as well, including MoveTask, etc. 2. No way to serialize the jobs and resume the progress at a later time.

            People

            • Assignee:
              Unassigned
              Reporter:
              Carl Steinbach
            • Votes:
              2 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

              • Created:
                Updated:

                Development