Pig
  1. Pig
  2. PIG-627

PERFORMANCE: multi-query optimization

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.3.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance:

      A = load 'data' as (a, b, c);
      B = filter A by a > 5;
      store B into 'output1';
      C = group B by b;
      store C into 'output2';

      This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly.

      1. doc-fix.patch
        5 kB
        Gunther Hagleitner
      2. error_handling_0415.patch
        27 kB
        Gunther Hagleitner
      3. error_handling_0416.patch
        27 kB
        Gunther Hagleitner
      4. file_cmds-0305.patch
        33 kB
        Gunther Hagleitner
      5. fix_store_prob.patch
        26 kB
        Gunther Hagleitner
      6. merge_741727_HEAD__0324_2.patch
        595 kB
        Gunther Hagleitner
      7. merge_741727_HEAD__0324.patch
        591 kB
        Gunther Hagleitner
      8. merge_trunk_to_branch.patch
        13 kB
        Gunther Hagleitner
      9. merge-041409.patch
        21 kB
        Gunther Hagleitner
      10. multiquery_0223.patch
        110 kB
        Gunther Hagleitner
      11. multiquery_0224.patch
        146 kB
        Gunther Hagleitner
      12. multiquery_0306.patch
        32 kB
        Richard Ding
      13. multiquery_explain_fix.patch
        3 kB
        Gunther Hagleitner
      14. multiquery-phase2_0313.patch
        86 kB
        Richard Ding
      15. multiquery-phase2_0323.patch
        88 kB
        Richard Ding
      16. multiquery-phase3_0423.patch
        77 kB
        Richard Ding
      17. multi-store-0303.patch
        77 kB
        Gunther Hagleitner
      18. multi-store-0304.patch
        78 kB
        Gunther Hagleitner
      19. non_reversible_store_load_dependencies_2.patch
        90 kB
        Gunther Hagleitner
      20. non_reversible_store_load_dependencies.patch
        76 kB
        Gunther Hagleitner
      21. noop_filter_absolute_path_flag_0401.patch
        125 kB
        Gunther Hagleitner
      22. noop_filter_absolute_path_flag.patch
        88 kB
        Gunther Hagleitner
      23. streaming-fix.patch
        10 kB
        Gunther Hagleitner

        Activity

        Hide
        Olga Natkovich added a comment -

        The proposal is as follows:

        (1) No changes in interactive mode. Each store or dump starts a job. Interactive mode is not about efficiency but about ease of use
        (2) In batch mode, a connected set of stores/dumps is executed together. Processing continues on Failures: after the run you can get response that some of the queries in the batch succeeded and some failed. (We need to figure out how to communicate this information to the users. Done files is one of the options.)

        The nice feature of this approach is that it requires no changes on the part of the user. After the change is implemented, some queries will just run faster.

        Show
        Olga Natkovich added a comment - The proposal is as follows: (1) No changes in interactive mode. Each store or dump starts a job. Interactive mode is not about efficiency but about ease of use (2) In batch mode, a connected set of stores/dumps is executed together. Processing continues on Failures: after the run you can get response that some of the queries in the batch succeeded and some failed. (We need to figure out how to communicate this information to the users. Done files is one of the options.) The nice feature of this approach is that it requires no changes on the part of the user. After the change is implemented, some queries will just run faster.
        Hide
        Alan Gates added a comment -

        I propose to implement this as follows.

        Currently split works by dumping all of its input to disk, and then starting MR jobs for each of it's outputs. So if you have a script like:

        A = load ...
        split A into B, beta ...
        C = filter B ...
        D = group C ...
        E = foreach D ...
        store E
        gamma = filter beta ...
        delta = group gamma ...
        epsilon = foreach delta ...
        store epsilon
        

        then A will be loaded and immediately stored by the split. This output will then be loaded before C, and run to the store E. Then the output will again be
        loaded for gamma and run to epsilon.

        If instead split was changed to have inner plans like foreach, the the above could be executed as A is loaded and the input passed to split. Each tuple it
        received it would run through a pipeline that contained C and a separate pipeline that contained gamma. Separate map reduce jobs would then be started, one to
        handle D-E and one delta-epsilon. This turns 3 reads of the data into one plus two partials (depending on how selective the two filters are).

        The relevance to the current issue is that queries like:

        A = load ..
        B = filter A ...
        store B ...
        C = group B ...
        D = foreach C ...
        store D;
        

        would be implicitly converted to:

        A = load ..
        B = filter A ...
        split B into B1, B2;
        store B1 ...
        C = group B2 ...
        D = foreach C ...
        store D;
        

        Changes needed to accomplish this:

        • Add an optimization pass that takes a plan with splits and rearranges it to be contained within the splits plus any subsequent MR jobs. This may need to be split up between the logical to physical translator and the MR compiler. It also needs to be able to handle diamonds in the plan, where split data comes back together, either as part of the same MR job or in a later job.
        • Implement a split operator that can contain inner plans. This is basically a foreach without a generate, and so hopefully much of the code from foreach could be shared or at least stolen. It will be somewhat different in that it will be able to contain any non-MR boundary forcing task (filter, foreach, dump, store) and not be able to contain distinct or order by.
        Show
        Alan Gates added a comment - I propose to implement this as follows. Currently split works by dumping all of its input to disk, and then starting MR jobs for each of it's outputs. So if you have a script like: A = load ... split A into B, beta ... C = filter B ... D = group C ... E = foreach D ... store E gamma = filter beta ... delta = group gamma ... epsilon = foreach delta ... store epsilon then A will be loaded and immediately stored by the split. This output will then be loaded before C, and run to the store E. Then the output will again be loaded for gamma and run to epsilon. If instead split was changed to have inner plans like foreach, the the above could be executed as A is loaded and the input passed to split. Each tuple it received it would run through a pipeline that contained C and a separate pipeline that contained gamma. Separate map reduce jobs would then be started, one to handle D-E and one delta-epsilon. This turns 3 reads of the data into one plus two partials (depending on how selective the two filters are). The relevance to the current issue is that queries like: A = load .. B = filter A ... store B ... C = group B ... D = foreach C ... store D; would be implicitly converted to: A = load .. B = filter A ... split B into B1, B2; store B1 ... C = group B2 ... D = foreach C ... store D; Changes needed to accomplish this: Add an optimization pass that takes a plan with splits and rearranges it to be contained within the splits plus any subsequent MR jobs. This may need to be split up between the logical to physical translator and the MR compiler. It also needs to be able to handle diamonds in the plan, where split data comes back together, either as part of the same MR job or in a later job. Implement a split operator that can contain inner plans. This is basically a foreach without a generate, and so hopefully much of the code from foreach could be shared or at least stolen. It will be somewhat different in that it will be able to contain any non-MR boundary forcing task (filter, foreach, dump, store) and not be able to contain distinct or order by.
        Hide
        Alan Gates added a comment -

        See http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification for a more concrete proposal that incorporates and extends the thoughts in the last comment.

        Show
        Alan Gates added a comment - See http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification for a more concrete proposal that incorporates and extends the thoughts in the last comment.
        Hide
        Gunther Hagleitner added a comment -

        This is for the multiquery branch. It's phase 1. It contains a lot of infrastructural work to be able to look at entire scripts during evaluation (batch mode). It will look at a script plan and insert splits whenever there is a shared sequence of operations. The split execution is still the same as it was before (load-store bridge).

        Show
        Gunther Hagleitner added a comment - This is for the multiquery branch. It's phase 1. It contains a lot of infrastructural work to be able to look at entire scripts during evaluation (batch mode). It will look at a script plan and insert splits whenever there is a shared sequence of operations. The split execution is still the same as it was before (load-store bridge).
        Hide
        Gunther Hagleitner added a comment -

        This patch includes the multiquery unit test cases.

        Show
        Gunther Hagleitner added a comment - This patch includes the multiquery unit test cases.
        Hide
        Olga Natkovich added a comment -

        I committed the latest patch. Ran the unit tests and they all passed.

        Couple of issues that need to be addressed:

        (1) PigServer.openIterator, in batch mode, always returns an empty iterator. That will not work if a script has a dump in it.
        (2) PigSever.getStorePlan assumes that each alias maps to a single store. In case of multiple queries that might not be true.

        Thanks Richard and Gunther for your contribution!

        Show
        Olga Natkovich added a comment - I committed the latest patch. Ran the unit tests and they all passed. Couple of issues that need to be addressed: (1) PigServer.openIterator, in batch mode, always returns an empty iterator. That will not work if a script has a dump in it. (2) PigSever.getStorePlan assumes that each alias maps to a single store. In case of multiple queries that might not be true. Thanks Richard and Gunther for your contribution!
        Hide
        Gunther Hagleitner added a comment -

        This patch introduces the functionality to support multiple stores in a single MR job. It's for the multiquery branch and it is needed to unblock concurrent dev on the split operator.

        There aren't enough unit tests in this patch yet. They will be provided once the split operator can use multi stores (right now, nothing actually uses these stores, so testing is difficult). In order to test the patch, I had temporarily turned multi store on for all queries (even if they only have one store) and then ran all the unit tests. All tests passed.

        Show
        Gunther Hagleitner added a comment - This patch introduces the functionality to support multiple stores in a single MR job. It's for the multiquery branch and it is needed to unblock concurrent dev on the split operator. There aren't enough unit tests in this patch yet. They will be provided once the split operator can use multi stores (right now, nothing actually uses these stores, so testing is difficult). In order to test the patch, I had temporarily turned multi store on for all queries (even if they only have one store) and then ran all the unit tests. All tests passed.
        Hide
        Gunther Hagleitner added a comment -

        Same as the other one except:

        • Documented the createStoreFunction method some more.
        • Removed unnecessary fields in the path parsing
        • Moved tear down of stores below extra streaming run (in PigMapBase's, PigMapReduce's close function)
        Show
        Gunther Hagleitner added a comment - Same as the other one except: Documented the createStoreFunction method some more. Removed unnecessary fields in the path parsing Moved tear down of stores below extra streaming run (in PigMapBase's, PigMapReduce's close function)
        Hide
        Pradeep Kamath added a comment -

        I committed multi-store-0304.patch into the multi-query branch after reviewing the changes.

        Show
        Pradeep Kamath added a comment - I committed multi-store-0304.patch into the multi-query branch after reviewing the changes.
        Hide
        Gunther Hagleitner added a comment -

        This patch is for the multi query branch again. It mostly fixes the problem with certain commands in the script that require immediate execution (in batch mode).

        So if you do stuff like:

        ...
        store a into 'tmp_foo';
        ...
        rm tmp_foo
        ...

        The rm will trigger execution and the file will be there for you to delete, copyToLocal, move, etc. You can also use the "exec" statement without params in a script now, to force execution of what we've seen so far.

        This patch also contains a minor fix with the computation of progress in MR jobs (which I screwed up in the last patch).

        Show
        Gunther Hagleitner added a comment - This patch is for the multi query branch again. It mostly fixes the problem with certain commands in the script that require immediate execution (in batch mode). So if you do stuff like: ... store a into 'tmp_foo'; ... rm tmp_foo ... The rm will trigger execution and the file will be there for you to delete, copyToLocal, move, etc. You can also use the "exec" statement without params in a script now, to force execution of what we've seen so far. This patch also contains a minor fix with the computation of progress in MR jobs (which I screwed up in the last patch).
        Hide
        Gunther Hagleitner added a comment -

        Oh. I also took out the restriction of the openIterator in batch mode. That was no longer needed.

        Show
        Gunther Hagleitner added a comment - Oh. I also took out the restriction of the openIterator in batch mode. That was no longer needed.
        Hide
        Olga Natkovich added a comment -

        A am reviewing this patch

        Show
        Olga Natkovich added a comment - A am reviewing this patch
        Hide
        Olga Natkovich added a comment -

        Looks like the patch has been committed but I will add my 2 cents anyways:

        (1) Looks like the test cases only test for success or failure but not for the correctness of results.
        (2) I was not quite sure what we need to executeBatch in grant for every dfs command. We treat those commands differently from pig commands anyways.

        Show
        Olga Natkovich added a comment - Looks like the patch has been committed but I will add my 2 cents anyways: (1) Looks like the test cases only test for success or failure but not for the correctness of results. (2) I was not quite sure what we need to executeBatch in grant for every dfs command. We treat those commands differently from pig commands anyways.
        Hide
        Richard Ding added a comment - - edited

        This patch contains the enhanced split operator to support multi-store queries. It instroduces a new MROperPlan adjuster that merges single-load mapper-only MapReduceOper to its predecesor based on the (implicit) split boundary. The goal is to reduce the total number of MR jobs for a given multi-query task.

        This patch is for the multi query branch.

        Show
        Richard Ding added a comment - - edited This patch contains the enhanced split operator to support multi-store queries. It instroduces a new MROperPlan adjuster that merges single-load mapper-only MapReduceOper to its predecesor based on the (implicit) split boundary. The goal is to reduce the total number of MR jobs for a given multi-query task. This patch is for the multi query branch.
        Hide
        Pradeep Kamath added a comment -

        multiquery_0306.patch seems to have a lot of code from the earlier patch ( multi-store-0304.patch). Richard, can you svn up your code base and regenerate the patch with only the changes you intended?

        Show
        Pradeep Kamath added a comment - multiquery_0306.patch seems to have a lot of code from the earlier patch ( multi-store-0304.patch). Richard, can you svn up your code base and regenerate the patch with only the changes you intended?
        Hide
        Richard Ding added a comment -

        The multiquery_0306.patch is the right one and doesn't need to regenerate. Pradeep will review it today.

        Show
        Richard Ding added a comment - The multiquery_0306.patch is the right one and doesn't need to regenerate. Pradeep will review it today.
        Hide
        Pradeep Kamath added a comment -

        Sorry about the misunderstanding, I think I looked at a different patch. After reviewing the right patch, here are some comments:

        The patch throws Java Exceptions like IllegalStateException. This should be replaced with the appropriate Exception class (like MRCompilerException) as specified in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification. The exception should be created with the error code, error source and error message constructor. New error codes should be introduced if one of the existing ones in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6 cannot be used. If new codes are introduced, the wiki table should be updated.

        The following can be used to check for file existence in BinStorage.determineSchema() - only in the case where the file does not exist, null should be returned

         public static boolean fileExists(String filename, DataStorage store)
                    throws IOException {
                ElementDescriptor elem = store.asElement(filename);
                return elem.exists() || globMatchesFiles(elem, store);
            }
         

        Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I wonder if we should have a ReverseDependencyOrderWalker since that is what the rootsFirst == false case will be. If we are not visiting roots to leaf, we really are not visiting in a dependency order - so the meaning of dependency order is no longer honored - this can be confusing I think. By explicitly naming the walker ReverseDependencyOrderWalker, the intent of walking from leaves to roots is more clear I think.

        In POSplit currently there is a PhysicalPlan representing the merged inner plans (where all plans are mutually exclusive) and there is also a List<PhysicalPlan> which has the same information in the form of a List. In the rest of pig code, inner plans have always been modelled as List<PhysicalPlan>. For consistency, it is better to just have a List<PhysicalPlan> to represent the inner plans.

        Show
        Pradeep Kamath added a comment - Sorry about the misunderstanding, I think I looked at a different patch. After reviewing the right patch, here are some comments: The patch throws Java Exceptions like IllegalStateException. This should be replaced with the appropriate Exception class (like MRCompilerException) as specified in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification . The exception should be created with the error code, error source and error message constructor. New error codes should be introduced if one of the existing ones in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6 cannot be used. If new codes are introduced, the wiki table should be updated. The following can be used to check for file existence in BinStorage.determineSchema() - only in the case where the file does not exist, null should be returned public static boolean fileExists( String filename, DataStorage store) throws IOException { ElementDescriptor elem = store.asElement(filename); return elem.exists() || globMatchesFiles(elem, store); } Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I wonder if we should have a ReverseDependencyOrderWalker since that is what the rootsFirst == false case will be. If we are not visiting roots to leaf, we really are not visiting in a dependency order - so the meaning of dependency order is no longer honored - this can be confusing I think. By explicitly naming the walker ReverseDependencyOrderWalker, the intent of walking from leaves to roots is more clear I think. In POSplit currently there is a PhysicalPlan representing the merged inner plans (where all plans are mutually exclusive) and there is also a List<PhysicalPlan> which has the same information in the form of a List. In the rest of pig code, inner plans have always been modelled as List<PhysicalPlan>. For consistency, it is better to just have a List<PhysicalPlan> to represent the inner plans.
        Hide
        Richard Ding added a comment -

        Thanks for reviewing the patch. These are excellent suggestions. I'll make sure that the changes you proposed will be included in the next patch.

        On exceptions, when do we use runtime exceptions? I'm trying to use runtime exceptions to indicate programming errors such as precondition violations or internal state errors.

        Show
        Richard Ding added a comment - Thanks for reviewing the patch. These are excellent suggestions. I'll make sure that the changes you proposed will be included in the next patch. On exceptions, when do we use runtime exceptions? I'm trying to use runtime exceptions to indicate programming errors such as precondition violations or internal state errors.
        Hide
        Pradeep Kamath added a comment -

        Committed patch per previous comment that the review comments will be addressed in the next patch - thanks Richard for the contribution.

        In general from Pig code we always want to throw known PigExceptions even for programming errors or internal state errors - in these cases, we just use the source of the Exception as PigExcetion.BUG. RuntimeException should be used when we want to throw an exception in a function which cannot throw any exceptions (like in methods from Hadoop API which we are implementing which do not throw any Exception)

        Show
        Pradeep Kamath added a comment - Committed patch per previous comment that the review comments will be addressed in the next patch - thanks Richard for the contribution. In general from Pig code we always want to throw known PigExceptions even for programming errors or internal state errors - in these cases, we just use the source of the Exception as PigExcetion.BUG. RuntimeException should be used when we want to throw an exception in a function which cannot throw any exceptions (like in methods from Hadoop API which we are implementing which do not throw any Exception)
        Hide
        Richard Ding added a comment - - edited

        This patch completes the phase 2 development as sepecified in the document http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification:

        1. Allow multiple stores in single job
        2. Merge multiple plans into the split operator
        3. Terminate all but one with stores

        As an example, the Pig script

        A = load 'data' as (a, b, c);
        B = filter A by a > 5;
        store B into 'output1';
        C = group B by b;
        store C into 'output2';
        

        in the introduction of this bug now results in a single map-reduce job.

        This patch is for the multi query branch.

        Show
        Richard Ding added a comment - - edited This patch completes the phase 2 development as sepecified in the document http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification: Allow multiple stores in single job Merge multiple plans into the split operator Terminate all but one with stores As an example, the Pig script A = load 'data' as (a, b, c); B = filter A by a > 5; store B into 'output1'; C = group B by b; store C into 'output2'; in the introduction of this bug now results in a single map-reduce job. This patch is for the multi query branch.
        Hide
        Gunther Hagleitner added a comment -

        Fixes three issues with explain:

        a) Ceci n'est pas un bug. Splits in interactive mode still need this branch.
        b) explain needs to discard batch iff it was loading a script
        c) Split is now a nested operator (and explain needs to know)

        This patch doesn't have any overlapped files with Richards last patch.

        Show
        Gunther Hagleitner added a comment - Fixes three issues with explain: a) Ceci n'est pas un bug. Splits in interactive mode still need this branch. b) explain needs to discard batch iff it was loading a script c) Split is now a nested operator (and explain needs to know) This patch doesn't have any overlapped files with Richards last patch.
        Hide
        Pradeep Kamath added a comment -

        Comments for Richard's patch - multiquery-phase2_0313.patch

        In MultiQueryOptimizer:

        • what about mr not being map only and with mr splittee? - is this not handled for now?
        • Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in
          mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain?
        • There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication.

        Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher

        In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good.

        spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker.

        The following comment in BinStorage needs to be clarified:

                if (!FileLocalizer.fileExists(fileName, storage)) {
                    // At compile time in batch mode, the file may not exist
                    // (such as intermediate file). Just return null - the
                    // same way as we could's get a valid record from the input. --> does this actually mean "the same way as we would if we did not get a valid record" ?
                    return null;
                }
                
        
        
        Show
        Pradeep Kamath added a comment - Comments for Richard's patch - multiquery-phase2_0313.patch In MultiQueryOptimizer: what about mr not being map only and with mr splittee? - is this not handled for now? Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain? There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication. Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good. spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker. The following comment in BinStorage needs to be clarified: if (!FileLocalizer.fileExists(fileName, storage)) { // At compile time in batch mode, the file may not exist // (such as intermediate file). Just return null - the // same way as we could's get a valid record from the input. --> does this actually mean "the same way as we would if we did not get a valid record" ? return null; }
        Hide
        Pradeep Kamath added a comment -

        +1 on Gunther's patch - multiquery_explain_fix.patch. Patch has been committed to the multiquery branch - thanks for the fix Gunther!

        Show
        Pradeep Kamath added a comment - +1 on Gunther's patch - multiquery_explain_fix.patch. Patch has been committed to the multiquery branch - thanks for the fix Gunther!
        Hide
        Richard Ding added a comment -

        Thanks for reviewing the patch.

        In MultiQueryOptimizer:

        • what about mr not being map only and with mr splittee? - is this not handled for now?

        Yeah. There are two cases where splittees will not be merged into splitter: (1) splitter is not map only and splittee has reducer, and (2) splittee has multiple roots (loads)

        • Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain?

        Explicit store/load combination in a script is transformed into an implicit split, hence the store remains

        • There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication.

        Fixed

        Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher

        Yes

        In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good.

        Fixed

        spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker.

        Fixed

        Show
        Richard Ding added a comment - Thanks for reviewing the patch. In MultiQueryOptimizer: what about mr not being map only and with mr splittee? - is this not handled for now? Yeah. There are two cases where splittees will not be merged into splitter: (1) splitter is not map only and splittee has reducer, and (2) splittee has multiple roots (loads) Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain? Explicit store/load combination in a script is transformed into an implicit split, hence the store remains There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication. Fixed Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher Yes In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good. Fixed spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker. Fixed
        Hide
        Pradeep Kamath added a comment -

        +1 on Richard's patch - multiquery-phase2_0323.patch, patch committed to multiquery branch - thanks for the contribution Richard.

        A general comment for the multiquery work is to introduce some negative test cases (which return POStatus.STATUS_ERR from some operator in the map or reduce plan affected by the multiQuqeryOptimizer) at some point.

        Show
        Pradeep Kamath added a comment - +1 on Richard's patch - multiquery-phase2_0323.patch, patch committed to multiquery branch - thanks for the contribution Richard. A general comment for the multiquery work is to introduce some negative test cases (which return POStatus.STATUS_ERR from some operator in the map or reduce plan affected by the multiQuqeryOptimizer) at some point.
        Hide
        Gunther Hagleitner added a comment -

        Merge of trunk (741727:HEAD) into multiquery branch. Aka merge from hell

        I ran all unit tests, the multiquery tests and the nightly tests and everything looks fine (no errors).

        Show
        Gunther Hagleitner added a comment - Merge of trunk (741727:HEAD) into multiquery branch. Aka merge from hell I ran all unit tests, the multiquery tests and the nightly tests and everything looks fine (no errors).
        Hide
        Gunther Hagleitner added a comment -

        Seems like the last merge patch didn't correctly contain the entire new TestFinish.java file. Well, this one does.

        Show
        Gunther Hagleitner added a comment - Seems like the last merge patch didn't correctly contain the entire new TestFinish.java file. Well, this one does.
        Hide
        Pradeep Kamath added a comment -

        +1 - committed patch by Gunther to merge changes in trunk to multiquery branch - thanks for the contribution Gunther.

        Show
        Pradeep Kamath added a comment - +1 - committed patch by Gunther to merge changes in trunk to multiquery branch - thanks for the contribution Gunther.
        Hide
        Gunther Hagleitner added a comment -

        This patch addresses an issue with the way we deal with scripts that do:
        {{

        { ... store a into 'foo'; a = load 'foo'; ... }

        }}

        In the logical plan this will end up as a split with one branch storing into 'foo' and the other continuing the processing after the load. The actual load is removed.

        This works well but has an unfortunate side effect. If the store/load mark the boundary between two map-reduce jobs the MRCompiler has to insert a tmp store-load bridge - which means that we now end up with two stores.

        This fix detects this case in the optimizing phase after the compilation. It removes the unnecessary store and loads from the other one.

        Show
        Gunther Hagleitner added a comment - This patch addresses an issue with the way we deal with scripts that do: {{ { ... store a into 'foo'; a = load 'foo'; ... } }} In the logical plan this will end up as a split with one branch storing into 'foo' and the other continuing the processing after the load. The actual load is removed. This works well but has an unfortunate side effect. If the store/load mark the boundary between two map-reduce jobs the MRCompiler has to insert a tmp store-load bridge - which means that we now end up with two stores. This fix detects this case in the optimizing phase after the compilation. It removes the unnecessary store and loads from the other one.
        Hide
        Gunther Hagleitner added a comment -

        This patch contains three items:

        • Removes the noop stores as described above
        • Makes load and store paths absolute and canonical
        • Introduces a flag that turns multiquery on and off (default is off)
        Show
        Gunther Hagleitner added a comment - This patch contains three items: Removes the noop stores as described above Makes load and store paths absolute and canonical Introduces a flag that turns multiquery on and off (default is off)
        Hide
        Gunther Hagleitner added a comment -

        This one is the same as before, but:

        • Added some comments
        • Reversed the multiquery flag (on by default)
        • HBase stuff works without the "hbase://" but will print warning
        • Fixed problem in NoopStoreRemover
        Show
        Gunther Hagleitner added a comment - This one is the same as before, but: Added some comments Reversed the multiquery flag (on by default) HBase stuff works without the "hbase://" but will print warning Fixed problem in NoopStoreRemover
        Hide
        Pradeep Kamath added a comment -

        +1, patch committed - thanks for the contribution Gunther.

        Show
        Pradeep Kamath added a comment - +1, patch committed - thanks for the contribution Gunther.
        Hide
        Gunther Hagleitner added a comment -

        This patch takes care of two things:

        • Cases where in a script you have a store followed by load where the Load/StoreFunc is either not reversible or they are different functions.
        • PlanSetter for physical plans in the JobControlCompiler (right now only the outermost plan's elements are set)
        Show
        Gunther Hagleitner added a comment - This patch takes care of two things: Cases where in a script you have a store followed by load where the Load/StoreFunc is either not reversible or they are different functions. PlanSetter for physical plans in the JobControlCompiler (right now only the outermost plan's elements are set)
        Hide
        Gunther Hagleitner added a comment -

        Same as above plus:

        • Fix for explain when a script has execution points inside.

        Like:

        {{

        { a = load ... ... store a exec; b = load ... ... }

        }}

        This will run explain once for each execution block.

        Show
        Gunther Hagleitner added a comment - Same as above plus: Fix for explain when a script has execution points inside. Like: {{ { a = load ... ... store a exec; b = load ... ... } }} This will run explain once for each execution block.
        Hide
        Pradeep Kamath added a comment -

        +1, patch committed. Thanks for the contribution Gunther!

        Show
        Pradeep Kamath added a comment - +1, patch committed. Thanks for the contribution Gunther!
        Hide
        Gunther Hagleitner added a comment -

        Merge latest trunk changes to branch

        Show
        Gunther Hagleitner added a comment - Merge latest trunk changes to branch
        Hide
        Olga Natkovich added a comment -

        patch reviewed and committed; thanks, Gunther.

        Show
        Olga Natkovich added a comment - patch reviewed and committed; thanks, Gunther.
        Hide
        Gunther Hagleitner added a comment -

        Some fixes in the patch "streaming-fix.patch":

        • The split operator wasn't always playing nicely with the way we run the pipeline one extra time in the mapper's or reducer's close function if there's a stream operator present
        • Moved the MR optimizer that sets "stream in map" and "stream in reduce" to the end of the queue.
        • PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That can result in the NoopFilterRemoval stage failing, because it's looking in the wrong plan.
        • Setting the jobname by default to the scriptname came in through the last merge, but didn't work anymore
        Show
        Gunther Hagleitner added a comment - Some fixes in the patch "streaming-fix.patch": The split operator wasn't always playing nicely with the way we run the pipeline one extra time in the mapper's or reducer's close function if there's a stream operator present Moved the MR optimizer that sets "stream in map" and "stream in reduce" to the end of the queue. PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That can result in the NoopFilterRemoval stage failing, because it's looking in the wrong plan. Setting the jobname by default to the scriptname came in through the last merge, but didn't work anymore
        Hide
        Pradeep Kamath added a comment -

        +1, Patch committed, thanks Gunther!

        Show
        Pradeep Kamath added a comment - +1, Patch committed, thanks Gunther!
        Hide
        Gunther Hagleitner added a comment -

        merge-041409.patch contains the latest merge from trunk to branch.

        Show
        Gunther Hagleitner added a comment - merge-041409.patch contains the latest merge from trunk to branch.
        Hide
        Gunther Hagleitner added a comment -

        javadoc changes only. doc-fix.patch contains "fixes" to silence javadoc warnings.

        Show
        Gunther Hagleitner added a comment - javadoc changes only. doc-fix.patch contains "fixes" to silence javadoc warnings.
        Hide
        Gunther Hagleitner added a comment -

        This patch contains:

        • Error codes/msg
        • Javadoc changes
        • fix the merge error in parser ("aliases" cmd)
        • updated golden files
        Show
        Gunther Hagleitner added a comment - This patch contains: Error codes/msg Javadoc changes fix the merge error in parser ("aliases" cmd) updated golden files
        Hide
        Gunther Hagleitner added a comment -

        Fixed some issues with the error handling patch (0415):

        • Duplicated error code 2129
        • Unclear string "splitter"
        • Added native exception message to error msg in store operator.
        Show
        Gunther Hagleitner added a comment - Fixed some issues with the error handling patch (0415): Duplicated error code 2129 Unclear string "splitter" Added native exception message to error msg in store operator.
        Hide
        Pradeep Kamath added a comment -

        reviewed doc-fix.patch and validated that javadoc warnings are fixed. +1 - committed.

        Show
        Pradeep Kamath added a comment - reviewed doc-fix.patch and validated that javadoc warnings are fixed. +1 - committed.
        Hide
        Pradeep Kamath added a comment -

        reviewed error_handling_0416.patch for additional changes per comment: https://issues.apache.org/jira/browse/PIG-627?focusedCommentId=12699990&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12699990. +1, committed after removing the javadoc related changes which were already committed in the previous commit.

        Show
        Pradeep Kamath added a comment - reviewed error_handling_0416.patch for additional changes per comment: https://issues.apache.org/jira/browse/PIG-627?focusedCommentId=12699990&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12699990 . +1, committed after removing the javadoc related changes which were already committed in the previous commit.
        Hide
        Pradeep Kamath added a comment -

        All the work till now (phase 1 and phase2) has now been committed to trunk. A tag (pre-multiquery-phase2) was created prior to commiting the multi query work since this a significantly big patch. The tag will serve as a baseline to trace down regressions.

        Show
        Pradeep Kamath added a comment - All the work till now (phase 1 and phase2) has now been committed to trunk. A tag (pre-multiquery-phase2) was created prior to commiting the multi query work since this a significantly big patch. The tag will serve as a baseline to trace down regressions.
        Hide
        Richard Ding added a comment -

        This patch completes the phase 3 development which merges multiple map-reduce aplittees into a splitter.

        As an example, the Pig script

        A = load ...
        split A into B, beta ...
        C = filter B ...
        D = group C ...
        E = foreach D ...
        store E
        gamma = filter beta ...
        delta = group gamma ...
        epsilon = foreach delta ...
        store epsilon
        

        discussed earlier in this bug now results in a single map-reduce job.

        This patch is for the multi query branch.

        Show
        Richard Ding added a comment - This patch completes the phase 3 development which merges multiple map-reduce aplittees into a splitter. As an example, the Pig script A = load ... split A into B, beta ... C = filter B ... D = group C ... E = foreach D ... store E gamma = filter beta ... delta = group gamma ... epsilon = foreach delta ... store epsilon discussed earlier in this bug now results in a single map-reduce job. This patch is for the multi query branch.
        Hide
        Alan Gates added a comment -

        Checked in multiquery-phase3_0423.patch to multiquery branch.

        Show
        Alan Gates added a comment - Checked in multiquery-phase3_0423.patch to multiquery branch.
        Hide
        Olga Natkovich added a comment -

        The code has been merged to the trunk. Thanks Richard and Gunther for contributing this complex and very useful feature!

        Show
        Olga Natkovich added a comment - The code has been merged to the trunk. Thanks Richard and Gunther for contributing this complex and very useful feature!

          People

          • Assignee:
            Gunther Hagleitner
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development