Pig
  1. Pig
  2. PIG-120

support hadoop map reduce in loal mode

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Currently pig support mapreduce and local as execution modes. LocalExecutionEngine is used for local and HExecutionEngine for map reduce. HExecutionEngine always expect that hadoop runs as cluster with a name node and jobtracker listing on a port.
      Though, hadoop can also run in a local mode (LocalJobRunner) this would give several advantages.
      First it would speed up the test suite significant. Second it would be possible to debug map reduce plans easily.
      For example we was able to debug and reproduce PIG-110 with this method.

      1. PIG-120_v_1.patch
        5 kB
        Stefan Groschupf

        Issue Links

          Activity

          Hide
          Stefan Groschupf added a comment -

          A patch that allows to run pig in mapreduce mode but uses the hadoop localjobrunner. This is for sure not the most elegant solution but a starting point. As mentioned in PIG-121 I guess HExecutionEngine and Co need a cleanup anyhow.
          This patch would be very useful for phase 1 of pig-119. It would be great if we can get this into trunk as a starting, since I guess PIG-121 will require some more discussion and work.

          Show
          Stefan Groschupf added a comment - A patch that allows to run pig in mapreduce mode but uses the hadoop localjobrunner. This is for sure not the most elegant solution but a starting point. As mentioned in PIG-121 I guess HExecutionEngine and Co need a cleanup anyhow. This patch would be very useful for phase 1 of pig-119. It would be great if we can get this into trunk as a starting, since I guess PIG-121 will require some more discussion and work.
          Hide
          Alan Gates added a comment -

          Based on your comments in https://issues.apache.org/jira/browse/PIG-119, my understanding is that you want to be able to run the existing pig tests in hadoop's local mode. But this patch provides a different test that runs in local mode. Is this not a step in the wrong direction?

          Show
          Alan Gates added a comment - Based on your comments in https://issues.apache.org/jira/browse/PIG-119 , my understanding is that you want to be able to run the existing pig tests in hadoop's local mode. But this patch provides a different test that runs in local mode. Is this not a step in the wrong direction?
          Hide
          Stefan Groschupf added a comment -

          Alan, sorry I'm not sure if I can follow you. In general I see 3 kind of how we can run pig, LocalExecutionEngine, HadoopExecutionEngine - map reduce using a hadoop cluster and HadoopExecutionEngine - map reduce using hadoops localJobRunner.
          This is very very helpful for debugging and profiling since the HadoopExecutionEngine is used but all runs in the same jvm.
          This patch makes it possible by not using a port in case the name node and job tracker are "local" and also not opening a remote proxy to the jobtracker in that case.

          Makes that sense?

          Show
          Stefan Groschupf added a comment - Alan, sorry I'm not sure if I can follow you. In general I see 3 kind of how we can run pig, LocalExecutionEngine, HadoopExecutionEngine - map reduce using a hadoop cluster and HadoopExecutionEngine - map reduce using hadoops localJobRunner. This is very very helpful for debugging and profiling since the HadoopExecutionEngine is used but all runs in the same jvm. This patch makes it possible by not using a port in case the name node and job tracker are "local" and also not opening a remote proxy to the jobtracker in that case. Makes that sense?
          Hide
          Alan Gates added a comment -

          If someone says:

          ant -Dtest.mode=local

          what happens? Does it run all the same tests as usual, only using local mode hadoop? Or does it run only tests that are specific to local mode hadoop? I was envisioning the former, but your inclusion in patch of a test (TestLocalMapReduce) that was specific to local mode made me think you were suggesting the latter.

          Your changes to HExecutionEngine would support either I think.

          Show
          Alan Gates added a comment - If someone says: ant -Dtest.mode=local what happens? Does it run all the same tests as usual, only using local mode hadoop? Or does it run only tests that are specific to local mode hadoop? I was envisioning the former, but your inclusion in patch of a test (TestLocalMapReduce) that was specific to local mode made me think you were suggesting the latter. Your changes to HExecutionEngine would support either I think.
          Hide
          Stefan Groschupf added a comment -

          I'm very sorry for the confusion.
          In general all the same tests run in every case we just switch execution engines and execution engine configurations.
          ant -Dtest.mode=excelLocal -> runs the pig local execution engine
          ant -Dtest.mode=mapredLocal -> runs the hadoop execution engine but using the hadoops LocalJobRunner – this should be default, since the test suite would run in the less than 50 % of the time
          ant -Dtest.mode=mapredCluster -> runs the hadoop execution egine with the minicluster.

          My testcase only test if it is possible to set "local" as nameNode and jobtracker - nothing else.

          I guess we can find better names for the test modes.

          Show
          Stefan Groschupf added a comment - I'm very sorry for the confusion. In general all the same tests run in every case we just switch execution engines and execution engine configurations. ant -Dtest.mode=excelLocal -> runs the pig local execution engine ant -Dtest.mode=mapredLocal -> runs the hadoop execution engine but using the hadoops LocalJobRunner – this should be default, since the test suite would run in the less than 50 % of the time ant -Dtest.mode=mapredCluster -> runs the hadoop execution egine with the minicluster. My testcase only test if it is possible to set "local" as nameNode and jobtracker - nothing else. I guess we can find better names for the test modes.
          Hide
          Pi Song added a comment -

          +1 with the concept. Allowing Hadoop local execution mode will be very beneficial for testing with a subset of data before going into production. Theoretically outputs from Pig local and Hadoop Mapreduce should be exactly the same but sometimes I found that they are different. In such case, I would trust local hadoop more than Pig local for my development.

          So, from the patch, if I want to run local hadoop Pig, I just have to set cluster and nameNode properties to "local" right?

          Show
          Pi Song added a comment - +1 with the concept. Allowing Hadoop local execution mode will be very beneficial for testing with a subset of data before going into production. Theoretically outputs from Pig local and Hadoop Mapreduce should be exactly the same but sometimes I found that they are different. In such case, I would trust local hadoop more than Pig local for my development. So, from the patch, if I want to run local hadoop Pig, I just have to set cluster and nameNode properties to "local" right?
          Hide
          Stefan Groschupf added a comment -

          yeah just set the cluster name to local.
          This patch is a beginning - i would love to more explizit support that in the future.
          A related proplem for example is that we can not define a jobtracker and namenode on different host by today.

          The configuration patch will solve some basic problems here - please vote it for better hadoop local mode support in the future.

          Show
          Stefan Groschupf added a comment - yeah just set the cluster name to local. This patch is a beginning - i would love to more explizit support that in the future. A related proplem for example is that we can not define a jobtracker and namenode on different host by today. The configuration patch will solve some basic problems here - please vote it for better hadoop local mode support in the future.
          Hide
          Alan Gates added a comment -

          Fix checked in as revision 633652. Thanks Stefan.

          Show
          Alan Gates added a comment - Fix checked in as revision 633652. Thanks Stefan.

            People

            • Assignee:
              Stefan Groschupf
              Reporter:
              Stefan Groschupf
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development