Pig
  1. Pig
  2. PIG-2693

LoadFunc.setLocation should be called before LoadMetadata.getStatistics

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None

      Description

      We ran into a bug with Pig/HCatalog integration on the trunk. The issue is that JobControlCompiler calls the adjustNumReducers method just before it calls setLocation on all of the LoadFuncs. This causes problems, since some loaders (i.e., HCatLoader) need setLocation to be called before it can respond to getStatistics with it's data size.

      1. PIG-2693.1.patch
        1 kB
        Bill Graham
      2. PIG-2693.2.patch
        2 kB
        Bill Graham
      3. PIG-2693.3.patch
        2 kB
        Julien Le Dem

        Activity

        Bill Graham created issue -
        Hide
        Bill Graham added a comment -

        When HCatalog blows up, the stack trace looks like this:

        2012-05-05 00:43:09,684 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Couldn't get statistics from LoadFunc: com.twitter.twadoop.dal.pig.DALPigLoader@2a2096d7
        java.io.IOException: java.lang.NullPointerException
                at org.apache.hcatalog.pig.HCatLoader.getStatistics(HCatLoader.java:194)
                at com.twitter.twadoop.dal.pig.DALPigLoader.getStatistics(DALPigLoader.java:135)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getInputSizeFromLoader(JobControlCompiler.java:866)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getInputSize(JobControlCompiler.java:824)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:805)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:745)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:378)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:264)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:149)
                at org.apache.pig.PigServer.launchPlan(PigServer.java:1265)
        ...
        
        Show
        Bill Graham added a comment - When HCatalog blows up, the stack trace looks like this: 2012-05-05 00:43:09,684 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Couldn't get statistics from LoadFunc: com.twitter.twadoop.dal.pig.DALPigLoader@2a2096d7 java.io.IOException: java.lang.NullPointerException at org.apache.hcatalog.pig.HCatLoader.getStatistics(HCatLoader.java:194) at com.twitter.twadoop.dal.pig.DALPigLoader.getStatistics(DALPigLoader.java:135) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getInputSizeFromLoader(JobControlCompiler.java:866) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getInputSize(JobControlCompiler.java:824) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:805) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:745) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:378) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:264) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:149) at org.apache.pig.PigServer.launchPlan(PigServer.java:1265) ...
        Hide
        Bill Graham added a comment -

        Attaching a patch that fixes the issue by calling setLocation before estimating the number of reducers. This passes unit tests and doesn't seem like it would have adverse effects, but please comment if you know a reason why it would.

        If this approach seems sound I can write up a unit test.

        Show
        Bill Graham added a comment - Attaching a patch that fixes the issue by calling setLocation before estimating the number of reducers. This passes unit tests and doesn't seem like it would have adverse effects, but please comment if you know a reason why it would. If this approach seems sound I can write up a unit test.
        Bill Graham made changes -
        Field Original Value New Value
        Attachment PIG-2693.1.patch [ 12526443 ]
        Hide
        Travis Crawford added a comment -

        Looking into this, I confirmed the reducer estimation worked as of https://github.com/apache/pig/commit/826f4333a2546a72adebf93e5d045420d0413c69 but throws the NPE on trunk.

        I'm looking through the changes since to see what's changed. Since there's no contract between LoadFunc and LoadMetadata whatever's changed is not invalid per-se, but it may be worth restoring the old behavior.

        Show
        Travis Crawford added a comment - Looking into this, I confirmed the reducer estimation worked as of https://github.com/apache/pig/commit/826f4333a2546a72adebf93e5d045420d0413c69 but throws the NPE on trunk. I'm looking through the changes since to see what's changed. Since there's no contract between LoadFunc and LoadMetadata whatever's changed is not invalid per-se, but it may be worth restoring the old behavior.
        Hide
        Travis Crawford added a comment -

        This patch is causing the issue:

        https://github.com/apache/pig/commit/d49f903e0d041ce7d9e8e2d0a9066cd666f99da7

        Looking into what's changed.

        commit d49f903e0d041ce7d9e8e2d0a9066cd666f99da7
        Author: Dmitriy V. Ryaboy <dvryaboy@apache.org>
        Date:   Fri Apr 27 23:34:03 2012 +0000
        
            PIG-2652:  Skew join and order by dont trigger reducer estimation (dvryaboy)
            
            git-svn-id: https://svn.apache.org/repos/asf/pig/trunk@1331637 13f79535-47bb-0310-9956-ffa450edef68
        
        Show
        Travis Crawford added a comment - This patch is causing the issue: https://github.com/apache/pig/commit/d49f903e0d041ce7d9e8e2d0a9066cd666f99da7 Looking into what's changed. commit d49f903e0d041ce7d9e8e2d0a9066cd666f99da7 Author: Dmitriy V. Ryaboy <dvryaboy@apache.org> Date: Fri Apr 27 23:34:03 2012 +0000 PIG-2652: Skew join and order by dont trigger reducer estimation (dvryaboy) git-svn-id: https: //svn.apache.org/repos/asf/pig/trunk@1331637 13f79535-47bb-0310-9956-ffa450edef68
        Hide
        Bill Graham added a comment -

        Thanks Travis, that's great sleuthing work. The patch shows the issue in the JobControlCompiler. The reducer estimation got moved up about 300 lines from around 600 to 380, which put that operation before the setLocation calls. I think the proposed patch seems valid then, which is to move it back down a bit, below setLocation.

        I aslo want to document this contract in LoadMetadata, which is that setLocation will be called first.

        Show
        Bill Graham added a comment - Thanks Travis, that's great sleuthing work. The patch shows the issue in the JobControlCompiler . The reducer estimation got moved up about 300 lines from around 600 to 380, which put that operation before the setLocation calls. I think the proposed patch seems valid then, which is to move it back down a bit, below setLocation . I aslo want to document this contract in LoadMetadata , which is that setLocation will be called first.
        Hide
        Travis Crawford added a comment -

        Confirmed the patch above fixes this issue - thanks Bill! It didn't apply cleanly but is quite small and was easy to reproduce against trunk. After updating the patch I can test again with the final version.

        Show
        Travis Crawford added a comment - Confirmed the patch above fixes this issue - thanks Bill! It didn't apply cleanly but is quite small and was easy to reproduce against trunk. After updating the patch I can test again with the final version.
        Hide
        Bill Graham added a comment -

        Here's a second patch generated against the current trunk. It also contains changes to the javadocs of LoadMetadata. Let me know if this one applies ok for you.

        Show
        Bill Graham added a comment - Here's a second patch generated against the current trunk. It also contains changes to the javadocs of LoadMetadata. Let me know if this one applies ok for you.
        Bill Graham made changes -
        Attachment PIG-2693.2.patch [ 12527531 ]
        Hide
        Travis Crawford added a comment -

        +1 (nonbinding). Thanks Bill! I just tested with this patch against trunk and reducer estimation works correctly with HCatLoader.

        Show
        Travis Crawford added a comment - +1 (nonbinding). Thanks Bill! I just tested with this patch against trunk and reducer estimation works correctly with HCatLoader.
        Hide
        Julien Le Dem added a comment -

        +1 this looks good to me.
        Dmitriy said he will re run the check for the reducer estimator to make sure there's no unexpected side effect.

        Show
        Julien Le Dem added a comment - +1 this looks good to me. Dmitriy said he will re run the check for the reducer estimator to make sure there's no unexpected side effect.
        Hide
        Julien Le Dem added a comment -

        Based on Dmitriy's patch the following tests should be run to check there is no regression:

        • TestEvalPipeline2
        • TestJobSubmission
        • TestPigRunner
        • TestPigStats

        Also the patch needs to be rebased

        Show
        Julien Le Dem added a comment - Based on Dmitriy's patch the following tests should be run to check there is no regression: TestEvalPipeline2 TestJobSubmission TestPigRunner TestPigStats Also the patch needs to be rebased
        Hide
        Bill Graham added a comment -

        I verified that all of test-commit passes.

        Show
        Bill Graham added a comment - I verified that all of test-commit passes.
        Hide
        Julien Le Dem added a comment -

        those tests are not part of it. That's why I'm mentioning it.

        Show
        Julien Le Dem added a comment - those tests are not part of it. That's why I'm mentioning it.
        Hide
        Julien Le Dem added a comment -

        PIG-2693.3.patch rebased patch
        all 4 tests pass:

        • TestEvalPipeline2
        • TestJobSubmission
        • TestPigRunner
        • TestPigStats
        Show
        Julien Le Dem added a comment - PIG-2693 .3.patch rebased patch all 4 tests pass: TestEvalPipeline2 TestJobSubmission TestPigRunner TestPigStats
        Julien Le Dem made changes -
        Attachment PIG-2693.3.patch [ 12528160 ]
        Julien Le Dem made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.11 [ 12318878 ]
        Resolution Fixed [ 1 ]
        Bill Graham made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        7d 22h 27m 1 Julien Le Dem 18/May/12 22:21
        Resolved Resolved Closed Closed
        279d 7h 32m 1 Bill Graham 22/Feb/13 04:53

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development