Pig
  1. Pig
  2. PIG-506

Does pig need a NATIVE keyword?

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: impl
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:

      A = load 'myfile';
      X = load 'myotherfile';
      B = group A by $0;
      C = foreach B generate group, myudf(B);
      D = native (jar=mymr.jar, infile=frompig outfile=topig);
      E = join D by $0, X by $0;
      ...
      

      This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.

      Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.

      1. NativeImplInitial.patch
        40 kB
        Aniket Mokashi
      2. NativeMapReduceFinale1.patch
        53 kB
        Aniket Mokashi
      3. NativeMapReduceFinale2.patch
        65 kB
        Aniket Mokashi
      4. NativeMapReduceFinale3.patch
        66 kB
        Aniket Mokashi
      5. PIG-506.2.patch
        42 kB
        Thejas M Nair
      6. PIG-506.3.patch
        46 kB
        Thejas M Nair
      7. PIG-506.patch
        68 kB
        Thejas M Nair
      8. TestWordCount.jar
        3 kB
        Aniket Mokashi

        Activity

        Hide
        Thejas M Nair added a comment -

        patch PIG-506.3.patch with changes suggested by Daniel committed to trunk.

        Show
        Thejas M Nair added a comment - patch PIG-506 .3.patch with changes suggested by Daniel committed to trunk.
        Hide
        Daniel Dai added a comment -

        Patch looks good. One minor comment, PlanHelper.LoadStoreFinder may better be PlanHelper.LoadStoreNativeFinder.

        Show
        Daniel Dai added a comment - Patch looks good. One minor comment, PlanHelper.LoadStoreFinder may better be PlanHelper.LoadStoreNativeFinder.
        Hide
        Thejas M Nair added a comment -

        Updated patch, earlier patch was missing src/org/apache/pig/newplan/logical/relational/LONative.java.

        test-patch and core tests are successful.

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 8 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Show
        Thejas M Nair added a comment - Updated patch, earlier patch was missing src/org/apache/pig/newplan/logical/relational/LONative.java. test-patch and core tests are successful. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Thejas M Nair added a comment -

        PIG-506.2.patch has

        • Changes to get mapreduce operator working with new logical plan
        • Changes to LO/PO Native operators - The store and load for the operator are no longer within it, they are part of the plan. As a result, several changes in visitors made for handling the load/store within LONative has been reverted.
        • Fix for reporting failure when MR job corresponding to native operator fails.
        • Removed TestTestNativeMapReduce from exclude list in ant target.

        Some issues still to be fixed, which i will address as part of new jiras -

        • PIG-1570 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior.
        • PIG-1571 If the output file for native MR exist, the query does not fail at compile time, it fails only at runtime. This file loaded in the nested load of native MR operator, it should be possible to check for this file.
        Show
        Thejas M Nair added a comment - PIG-506 .2.patch has Changes to get mapreduce operator working with new logical plan Changes to LO/PO Native operators - The store and load for the operator are no longer within it, they are part of the plan. As a result, several changes in visitors made for handling the load/store within LONative has been reverted. Fix for reporting failure when MR job corresponding to native operator fails. Removed TestTestNativeMapReduce from exclude list in ant target. Some issues still to be fixed, which i will address as part of new jiras - PIG-1570 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. PIG-1571 If the output file for native MR exist, the query does not fail at compile time, it fails only at runtime. This file loaded in the nested load of native MR operator, it should be possible to check for this file.
        Hide
        Thejas M Nair added a comment -

        Unit test passed, and I committed the changes. But it fails with latest changes to switch to new logical plan. I have added the test cases to exclude list in build.xml .
        Keeping the jira open until this is fixed.

        Show
        Thejas M Nair added a comment - Unit test passed, and I committed the changes. But it fails with latest changes to switch to new logical plan. I have added the test cases to exclude list in build.xml . Keeping the jira open until this is fixed.
        Hide
        Thejas M Nair added a comment -

        New patch address my comments.
        test-patch results -
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 10 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 433 release audit warnings (more than the trunk's current 425 warnings).

        release audit warnings are for the javadoc html files
        I will commit once all unit tests pass.

        Show
        Thejas M Nair added a comment - New patch address my comments. test-patch results - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 10 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 433 release audit warnings (more than the trunk's current 425 warnings). release audit warnings are for the javadoc html files I will commit once all unit tests pass.
        Hide
        Thejas M Nair added a comment -

        Additonal changes are also required in SchemaAliasVisitor and InputOutputFileVisitor . I will submit a new patch addressing this and other comments.
        Created PIG-1553 to address the handling of parallel keyword.
        (Assigning back to Aniket since the changes I am making are relatively minor.)

        Show
        Thejas M Nair added a comment - Additonal changes are also required in SchemaAliasVisitor and InputOutputFileVisitor . I will submit a new patch addressing this and other comments. Created PIG-1553 to address the handling of parallel keyword. (Assigning back to Aniket since the changes I am making are relatively minor.)
        Hide
        Aniket Mokashi added a comment -

        Current patch doesnt consider case for parallel keyword – we can fix this by adding -D mapred.reduce.tasks=n to params of RunJar. (ToDo)

        Show
        Aniket Mokashi added a comment - Current patch doesnt consider case for parallel keyword – we can fix this by adding -D mapred.reduce.tasks=n to params of RunJar. (ToDo)
        Hide
        Thejas M Nair added a comment -

        Additional comment on the patch-

        • In PigStats.java , the null check on js has been removed. Is it no longer necessary ?
        Show
        Thejas M Nair added a comment - Additional comment on the patch- In PigStats.java , the null check on js has been removed. Is it no longer necessary ?
        Hide
        Thejas M Nair added a comment -

        Comments on the patch-

        • In lot of places there is code like "mro.getClass() == NativeMapReduceOper.class". It is better to use instanceof, the code will be maintainable, if we decide to extend NativeMapReduceOper in future. (Also, its slightly readable.)
        • In MapReduceLauncher.launchPig(), the code to calculate the progress and notify it is in two places, it can be moved to a function.
        • In TestNativeMapReduce.java , a comment mentioning the the source of the jar file will be useful.
        Show
        Thejas M Nair added a comment - Comments on the patch- In lot of places there is code like "mro.getClass() == NativeMapReduceOper.class". It is better to use instanceof, the code will be maintainable, if we decide to extend NativeMapReduceOper in future. (Also, its slightly readable.) In MapReduceLauncher.launchPig(), the code to calculate the progress and notify it is in two places, it can be moved to a function. In TestNativeMapReduce.java , a comment mentioning the the source of the jar file will be useful.
        Hide
        Aniket Mokashi added a comment -

        Submitting the updated patch

        Show
        Aniket Mokashi added a comment - Submitting the updated patch
        Hide
        Aniket Mokashi added a comment -

        We also need to add this jar in lib to get tests working.
        ToDo- CreateTestJarAtRuntime

        Show
        Aniket Mokashi added a comment - We also need to add this jar in lib to get tests working. ToDo- CreateTestJarAtRuntime
        Hide
        Aniket Mokashi added a comment -

        Attaching the final patch-
        Includes - MR changes, optimizer related changes, test cases for basic mr.
        ToDo- Test cases for optimizer

        Show
        Aniket Mokashi added a comment - Attaching the final patch- Includes - MR changes, optimizer related changes, test cases for basic mr. ToDo- Test cases for optimizer
        Hide
        Aniket Mokashi added a comment -

        Wiki page explaining details of specification and implementation has been uploaded at - http://wiki.apache.org/pig/NativeMapReduce

        Show
        Aniket Mokashi added a comment - Wiki page explaining details of specification and implementation has been uploaded at - http://wiki.apache.org/pig/NativeMapReduce
        Hide
        Aniket Mokashi added a comment -

        Attached patch has initial implementation for this feature--
        Dump, store, explain work fine. PigStats are generated properly.

        ToDos-
        Check for multiquery optimization related tests
        Add test cases

        Usage-
        A = load 'dict.txt';
        B = mapreduce 'hadoop-0.20.2-examples.jar' Store A into 'input' Load 'output' `wordcount input output`;

        Show
        Aniket Mokashi added a comment - Attached patch has initial implementation for this feature-- Dump, store, explain work fine. PigStats are generated properly. ToDos- Check for multiquery optimization related tests Add test cases Usage- A = load 'dict.txt'; B = mapreduce 'hadoop-0.20.2-examples.jar' Store A into 'input' Load 'output' `wordcount input output`;
        Hide
        Aniket Mokashi added a comment -

        for the better re-usability of parser code with less distortion to syntax, we can use -

        B = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE A INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
        

        params is needed as some map reduce jobs take parameters.
        Also, we assume that mymr.jar has a main method responsible for setting up required jobconf.

        Alternatively,
        mymr.jar can have getJobConf() hook for pig (documented) so that pig can take the JobConf from the mymj job, add some more stuff if needed and run this job.

        Show
        Aniket Mokashi added a comment - for the better re-usability of parser code with less distortion to syntax, we can use - B = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE A INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; params is needed as some map reduce jobs take parameters. Also, we assume that mymr.jar has a main method responsible for setting up required jobconf. Alternatively, mymr.jar can have getJobConf() hook for pig (documented) so that pig can take the JobConf from the mymj job, add some more stuff if needed and run this job.
        Hide
        Aniket Mokashi added a comment -

        Revised syntax – as from the proposal document (+ few changes)

        B = native ('mymr.jar' [, 'other.jar' ...]) A store into 'storeLocation' using storeFunc load 'loadLocation' using loadFunc;
        

        mymr.jar contains the MR code the user wants to run.
        storeLocation is location the user's code expects to find the data.
        storeFunc is the storage function Pig will use to store the data (from A)
        loadLocation is where user's code will write the result data
        loadFunc is the load function Pig will use to reload the data (into B)
        other,jar contains jars to be shipped like InputFormat, OutputFormat for custom handling of mapreduce jobs.

        Show
        Aniket Mokashi added a comment - Revised syntax – as from the proposal document (+ few changes) B = native ('mymr.jar' [, 'other.jar' ...]) A store into 'storeLocation' using storeFunc load 'loadLocation' using loadFunc; mymr.jar contains the MR code the user wants to run. storeLocation is location the user's code expects to find the data. storeFunc is the storage function Pig will use to store the data (from A) loadLocation is where user's code will write the result data loadFunc is the load function Pig will use to reload the data (into B) other,jar contains jars to be shipped like InputFormat, OutputFormat for custom handling of mapreduce jobs.
        Hide
        ashitosh added a comment -

        Sorry for the earlier link
        I uploaded my proposal on google docs
        http://docs.google.com/View?id=dxzmjgh_2hbmr5zf9

        Thanks

        Show
        ashitosh added a comment - Sorry for the earlier link I uploaded my proposal on google docs http://docs.google.com/View?id=dxzmjgh_2hbmr5zf9 Thanks
        Hide
        Ashutosh Chauhan added a comment -

        Ashitosh,

        When I click on that link, I get:

        You do not have the required role. 
        

        Do you need to set permissions for it to be world-readable? (if that is what you are intending to do)

        Show
        Ashutosh Chauhan added a comment - Ashitosh, When I click on that link, I get: You do not have the required role. Do you need to set permissions for it to be world-readable? (if that is what you are intending to do)
        Hide
        ashitosh added a comment -

        I have published my proposal on the Gsoc application
        http://socghop.appspot.com/gsoc/student_proposal/private/google/gsoc2010/ashitosh/t127081039065
        Any feedback is more than welcome.

        Show
        ashitosh added a comment - I have published my proposal on the Gsoc application http://socghop.appspot.com/gsoc/student_proposal/private/google/gsoc2010/ashitosh/t127081039065 Any feedback is more than welcome.
        Hide
        Daniel Dai added a comment -

        Mark it to be a candidate project for "Google summer of code 2010" program.

        Show
        Daniel Dai added a comment - Mark it to be a candidate project for "Google summer of code 2010" program.
        Hide
        David Ciemiewicz added a comment -

        Alan,

        This seems much cleaner way to set up native Hadoop map-reduce jobs than the command line interfaces people use today. Might be worth it just for that alone.

        I think you'd need to gather some examples from non-Pig users and prototype them as Pig/NATIVE scripts to demonstrate what the advantages would be.

        For me, as a primary Pig user, there is some appeal because I could benefit from borrowing other's code.

        Show
        David Ciemiewicz added a comment - Alan, This seems much cleaner way to set up native Hadoop map-reduce jobs than the command line interfaces people use today. Might be worth it just for that alone. I think you'd need to gather some examples from non-Pig users and prototype them as Pig/NATIVE scripts to demonstrate what the advantages would be. For me, as a primary Pig user, there is some appeal because I could benefit from borrowing other's code.

          People

          • Assignee:
            Aniket Mokashi
            Reporter:
            Alan Gates
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development