Pig
  1. Pig
  2. PIG-1815

pig task retains used instances of PhysicalPlan

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.1
    • Component/s: None
    • Labels:
      None

      Description

      map tasks of a pig query ran out of memory because there were too many (thousands) instances of combiner PhysicalPlan in memory. Each physical plan (except the last?) was linked to older one as shown in the yourkit snapshot that I am attaching.

      This problem was noticed with 0.8 because of the split combination feature, that resulted in each map having larger inputs. The query also had large physical plan because of multi-query, it had 17 MR jobs merged into one during the multi-query optimization phase.

      1. PIG-1815.1.patch
        1.0 kB
        Thejas M Nair
      2. yourkit_combiner_hprof.jpg
        188 kB
        Thejas M Nair

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          17h 36m 1 Thejas M Nair 21/Jan/11 17:29
          Patch Available Patch Available Resolved Resolved
          3h 40m 1 Thejas M Nair 21/Jan/11 21:10
          Resolved Resolved Closed Closed
          93d 2h 52m 1 Daniel Dai 25/Apr/11 01:02
          Daniel Dai made changes -
          Fix Version/s 0.8.1 [ 12316393 ]
          Fix Version/s 0.8.0 [ 12314562 ]
          Fix Version/s 0.9.0 [ 12315191 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Thejas M Nair made changes -
          Link This issue is duplicated by PIG-1803 [ PIG-1803 ]
          Hide
          Scott Carey added a comment -

          I haven't followed the exact call sequence that leads to it, but it looks like a PhysicalPlan instance is created with reference to a copy of the previous Reducer$Context, and since this is a inner class of PigCombiner$Combine (a subclass of Reducer) it has a reference (this$0) to it.

          I don't know if this is new info to anyone reading this, but there are two types of inner classes, those with the 'this' reference to the enclosing class and those without it. Those that don't hold references to their enclosing class are those marked 'static'. If this inner class can be made static, it would remove that reference (and make the object a little smaller).

          In general, you can help the GC and memory footprint by making sure inner classes are 'static' whenever they don't need to have the implicit 'this' reference.

          Show
          Scott Carey added a comment - I haven't followed the exact call sequence that leads to it, but it looks like a PhysicalPlan instance is created with reference to a copy of the previous Reducer$Context, and since this is a inner class of PigCombiner$Combine (a subclass of Reducer) it has a reference (this$0) to it. I don't know if this is new info to anyone reading this, but there are two types of inner classes, those with the 'this' reference to the enclosing class and those without it. Those that don't hold references to their enclosing class are those marked 'static'. If this inner class can be made static, it would remove that reference (and make the object a little smaller). In general, you can help the GC and memory footprint by making sure inner classes are 'static' whenever they don't need to have the implicit 'this' reference.
          Thejas M Nair made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Thejas M Nair added a comment -

          Patch committed to 0.8 branch and trunk.

          Show
          Thejas M Nair added a comment - Patch committed to 0.8 branch and trunk.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks for the description, Thejas.
          +1

          Show
          Ashutosh Chauhan added a comment - Thanks for the description, Thejas. +1
          Hide
          Thejas M Nair added a comment -

           The references in the heap dump show that the POUserFunc a plan (except the oldest one?) has reference to Reducer$Context ( POUserFunc -> udf object -> ProgressableReporter -> Reducer$Context). But the Reducer$Context object has reference to PigCombiner$Combine which has reference to another (previously created?) PhysicalPlan. So any Combiner PhysicalPlan instance that has been created in the map task has a reference to it and can't be freed by GC .

          I haven't followed the exact call sequence that leads to it, but it looks like a PhysicalPlan instance is created with reference to a copy of the previous Reducer$Context, and since this is a inner class of PigCombiner$Combine (a subclass of Reducer) it has a reference (this$0) to it. And this older PigCombiner$Combine has references to the old physical plan. The old physical plan has references to the older Reducer$Context and so on.
          To break this chain, in this patch I clean the references to the PhysicalPlan in PigCombiner$Combine when the cleanup method is called.

          I had a look at the hadoop mapreduce code that does the sort-and-spill of map output (org.apache.hadoop.mapred$MapOutputBuffer$SpillThread.sortAndSpill() ), and it looks like one combiner class instance is created for every partition (ie reducer).
          In case of the query whose map tasks ran out of memory, mapred.reduce.tasks was set to 300, ie 300 instances of combiner class , and therefore 300 instances of physical plan will be created for every spill. The query in 0.8 also had several spills ( 10+) , which means that there will be more than 3000 instances of PhysicalPlan lying around. The Physical plans in this case were also large because it was a 'multi-query' , and 17 MR jobs were merged into 1.

          ie, The failure can happen in any query which uses combiner. There just needs to be large number of instances of physical plan , and number of physical plan instances = number-of-reducers * number-of-spills. If the PhysicalPlan is large, you need fewer instances of it for failure.

          Show
          Thejas M Nair added a comment -  The references in the heap dump show that the POUserFunc a plan (except the oldest one?) has reference to Reducer$Context ( POUserFunc -> udf object -> ProgressableReporter -> Reducer$Context). But the Reducer$Context object has reference to PigCombiner$Combine which has reference to another (previously created?) PhysicalPlan. So any Combiner PhysicalPlan instance that has been created in the map task has a reference to it and can't be freed by GC . I haven't followed the exact call sequence that leads to it, but it looks like a PhysicalPlan instance is created with reference to a copy of the previous Reducer$Context, and since this is a inner class of PigCombiner$Combine (a subclass of Reducer) it has a reference (this$0) to it. And this older PigCombiner$Combine has references to the old physical plan. The old physical plan has references to the older Reducer$Context and so on. To break this chain, in this patch I clean the references to the PhysicalPlan in PigCombiner$Combine when the cleanup method is called. I had a look at the hadoop mapreduce code that does the sort-and-spill of map output (org.apache.hadoop.mapred$MapOutputBuffer$SpillThread.sortAndSpill() ), and it looks like one combiner class instance is created for every partition (ie reducer). In case of the query whose map tasks ran out of memory, mapred.reduce.tasks was set to 300, ie 300 instances of combiner class , and therefore 300 instances of physical plan will be created for every spill. The query in 0.8 also had several spills ( 10+) , which means that there will be more than 3000 instances of PhysicalPlan lying around. The Physical plans in this case were also large because it was a 'multi-query' , and 17 MR jobs were merged into 1. ie, The failure can happen in any query which uses combiner. There just needs to be large number of instances of physical plan , and number of physical plan instances = number-of-reducers * number-of-spills. If the PhysicalPlan is large, you need fewer instances of it for failure.
          Hide
          Ashutosh Chauhan added a comment -

          Great find.
          I couldn't follow when and what may cause this. If its easy enough for you, can you give a small pig query snippet which may result in this and why. And also how patch fixes the issue.

          Show
          Ashutosh Chauhan added a comment - Great find. I couldn't follow when and what may cause this. If its easy enough for you, can you give a small pig query snippet which may result in this and why. And also how patch fixes the issue.
          Thejas M Nair made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Thejas M Nair made changes -
          Attachment PIG-1815.1.patch [ 12468991 ]
          Hide
          Thejas M Nair added a comment -

          I have tested the patch using the query that was running out of memory. Patch does not have any unit tests, as I can't think of a good/easy way to test the leak in unit test.
          All unit tests except TestScriptLanguage passed. TestScriptLanguage is failing even without the patch.

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
          [exec] Please justify why no tests are needed for this patch.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          Thejas M Nair added a comment - I have tested the patch using the query that was running out of memory. Patch does not have any unit tests, as I can't think of a good/easy way to test the leak in unit test. All unit tests except TestScriptLanguage passed. TestScriptLanguage is failing even without the patch. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Thejas M Nair made changes -
          Field Original Value New Value
          Attachment yourkit_combiner_hprof.jpg [ 12468925 ]
          Thejas M Nair created issue -

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development