Pig
  1. Pig
  2. PIG-2610

GC errors on using FILTER within nested FOREACH

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      User has reported running into GC overhead errors while trying to use FILTER within FOREACH and aggregating the filtered field. Here is the sample PigLatin script provided by the user that generated this issue.

      raw = LOAD 'input' using MyCustomLoader();
      
      searches = FOREACH raw GENERATE
                     day, searchType,
                     FLATTEN(impBag) AS (adType, clickCount)
                 ;
      
      groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
      counts = FOREACH groupedSearches{
                     type1 = FILTER searches BY adType == 'type1';
                     type2 = FILTER searches BY adType == 'type2';
                     GENERATE
                         FLATTEN(group) AS (day, searchType),
                         COUNT(searches) numSearches,
                         SUM(clickCount) AS clickCountPerSearchType,
                         SUM(type1.clickCount) AS type1ClickCount,
                         SUM(type2.clickCount) AS type2ClickCount;
             };
      

      Pig should be able to handle this case.

        Activity

        Hide
        Daniel Dai added a comment -

        Can you post exception stack?

        Show
        Daniel Dai added a comment - Can you post exception stack?
        Hide
        Rohini added a comment -

        Here is the stack trace

        2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

        Show
        Rohini added a comment - Here is the stack trace 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249)
        Hide
        Dmitriy V. Ryaboy added a comment -

        Ok so the Jira I meant to ask to open on this wasn't about a GC error (just push the filter above the group), but about the fact that the optimizer can do this automatically, with a little bit of trickiness (the filters need to be turned into generates, and the counts into sums).

        Show
        Dmitriy V. Ryaboy added a comment - Ok so the Jira I meant to ask to open on this wasn't about a GC error (just push the filter above the group), but about the fact that the optimizer can do this automatically, with a little bit of trickiness (the filters need to be turned into generates, and the counts into sums).
        Hide
        Daniel Dai added a comment -

        Yes, we shall open a Jira for the new rule. For now, you can try to manually optimize the script by moving filter before group and project necessary columns before group. The GC exception is not from bag but from POProject, my suspicion is hadoop shuffle/sorting use too much memory and there is no memory for Pig to turn around.

        Show
        Daniel Dai added a comment - Yes, we shall open a Jira for the new rule. For now, you can try to manually optimize the script by moving filter before group and project necessary columns before group. The GC exception is not from bag but from POProject, my suspicion is hadoop shuffle/sorting use too much memory and there is no memory for Pig to turn around.
        Hide
        Prashant Kommireddi added a comment -

        How is this case different (from Pig Latin basics page)?

        A = LOAD 'data' AS (url:chararray,outlink:chararray);
        
        DUMP A;
        (www.ccc.com,www.hjk.com)
        (www.ddd.com,www.xyz.org)
        (www.aaa.com,www.cvn.org)
        (www.www.com,www.kpt.net)
        (www.www.com,www.xyz.org)
        (www.ddd.com,www.xyz.org)
        
        B = GROUP A BY url;
        
        DUMP B;
        (www.aaa.com,{(www.aaa.com,www.cvn.org)})
        (www.ccc.com,{(www.ccc.com,www.hjk.com)})
        (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
        (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
        
        
        X = FOREACH B {
                FA= FILTER A BY outlink == 'www.xyz.org';
                PA = FA.outlink;
                DA = DISTINCT PA;
                GENERATE group, COUNT(DA);
        }
        
        DUMP X;
        (www.aaa.com,0)
        (www.ccc.com,0)
        (www.ddd.com,1)
        (www.www.com,1)
        
        
        Show
        Prashant Kommireddi added a comment - How is this case different (from Pig Latin basics page)? A = LOAD 'data' AS (url:chararray,outlink:chararray); DUMP A; (www.ccc.com,www.hjk.com) (www.ddd.com,www.xyz.org) (www.aaa.com,www.cvn.org) (www.www.com,www.kpt.net) (www.www.com,www.xyz.org) (www.ddd.com,www.xyz.org) B = GROUP A BY url; DUMP B; (www.aaa.com,{(www.aaa.com,www.cvn.org)}) (www.ccc.com,{(www.ccc.com,www.hjk.com)}) (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)}) (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)}) X = FOREACH B { FA= FILTER A BY outlink == 'www.xyz.org'; PA = FA.outlink; DA = DISTINCT PA; GENERATE group, COUNT(DA); } DUMP X; (www.aaa.com,0) (www.ccc.com,0) (www.ddd.com,1) (www.www.com,1)

          People

          • Assignee:
            Unassigned
            Reporter:
            Prashant Kommireddi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development