Pig
  1. Pig
  2. PIG-975

Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None

      Description

      POPackage uses DefaultDataBag during reduce process to hold data. It is registered with SpillableMemoryManager and prone to OutOfMemoryException. It's better to pro-actively managers the usage of the memory. The bag fills in memory to a specified amount, and dump the rest the disk. The amount of memory to hold tuples is configurable. This can avoid out of memory error.

      1. PIG-975.patch
        10 kB
        Ying He
      2. PIG-975.patch2
        10 kB
        Ying He
      3. PIG-975.patch3
        10 kB
        Ying He
      4. internalbag.xls
        80 kB
        Ying He
      5. PIG-975.patch4
        11 kB
        Ying He

        Issue Links

          Activity

          Ying He created issue -
          Ying He made changes -
          Field Original Value New Value
          Link This issue is a clone of PIG-636 [ PIG-636 ]
          Ying He made changes -
          Description Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.
          POPackage uses DefaultDataBag during reduce process to hold data. It is registered with SpillableMemoryManager and prone to OutOfMemoryException. It's better to pro-actively managers the usage of the memory. The bag fills in memory to a specified amount, and dump the rest the disk. The amount of memory to hold tuples is configurable. This can avoid out of memory error.
          Hide
          Ying He added a comment -

          implement a new bag and use it in POPackage

          Show
          Ying He added a comment - implement a new bag and use it in POPackage
          Ying He made changes -
          Attachment PIG-975.patch [ 12420482 ]
          Hide
          Ying He added a comment -

          remove System.out.println

          Show
          Ying He added a comment - remove System.out.println
          Ying He made changes -
          Attachment PIG-975.patch2 [ 12420486 ]
          Olga Natkovich made changes -
          Assignee Pradeep Kamath [ pkamath ] Ying He [ yinghe ]
          Olga Natkovich made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Olga Natkovich added a comment -

          Couple of questions comments on the patch:

          • Why do we need to synchronize in add. Who else is accessing the bag since it is no longer managed by spillable manager?
          • Memory fraction should be a java property so that users can control it they choose so
          • Why do we have limit of only 100 tuples in memory since we already have memory limit? Also, if we do need it, shouldn't it be configurable?
          Show
          Olga Natkovich added a comment - Couple of questions comments on the patch: Why do we need to synchronize in add. Who else is accessing the bag since it is no longer managed by spillable manager? Memory fraction should be a java property so that users can control it they choose so Why do we have limit of only 100 tuples in memory since we already have memory limit? Also, if we do need it, shouldn't it be configurable?
          Hide
          Ying He added a comment -

          Answer to Olga's questions:

          1. The synchronization can be removed.
          2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5
          3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory

          Show
          Ying He added a comment - Answer to Olga's questions: 1. The synchronization can be removed. 2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5 3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory
          Hide
          Olga Natkovich added a comment -

          Ok, then lets remove the synchronization. The rest looks good. Could you also put perf numbers that you see with this change

          Show
          Olga Natkovich added a comment - Ok, then lets remove the synchronization. The rest looks good. Could you also put perf numbers that you see with this change
          Hide
          Ying He added a comment -

          remove synchronization

          Show
          Ying He added a comment - remove synchronization
          Ying He made changes -
          Attachment PIG-975.patch3 [ 12420570 ]
          Hide
          Ying He added a comment -

          performance numbers

          Show
          Ying He added a comment - performance numbers
          Ying He made changes -
          Attachment internalbag.xls [ 12420571 ]
          Hide
          Pradeep Kamath added a comment -

          I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.

          Show
          Pradeep Kamath added a comment - I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.
          Hide
          Ying He added a comment -

          I think this is too implementation specific to expose to end user. Frankly, I don't think user cares which class we use for the data bags.

          Show
          Ying He added a comment - I think this is too implementation specific to expose to end user. Frankly, I don't think user cares which class we use for the data bags.
          Hide
          Olga Natkovich added a comment -

          Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.

          Show
          Olga Natkovich added a comment - Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.
          Hide
          Ying He added a comment -

          Add switch to old bag. Setting property pig.cachedbag.type=default would switch to old default bag. If not specified, use InternalCachedBag.l

          Show
          Ying He added a comment - Add switch to old bag. Setting property pig.cachedbag.type=default would switch to old default bag. If not specified, use InternalCachedBag.l
          Ying He made changes -
          Attachment PIG-975.patch4 [ 12420603 ]
          Olga Natkovich made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Olga Natkovich made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 0.4.0 [ 12314042 ]
          Affects Version/s 0.2.0 [ 12313783 ]
          Fix Version/s 0.6.0 [ 12314214 ]
          Fix Version/s 0.2.0 [ 12313783 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420603/PIG-975.patch4
          against trunk revision 819691.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 406 javac compiler warnings (more than the trunk's current 403 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 release audit. The applied patch generated 278 release audit warnings (more than the trunk's current 277 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420603/PIG-975.patch4 against trunk revision 819691. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 406 javac compiler warnings (more than the trunk's current 403 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 release audit. The applied patch generated 278 release audit warnings (more than the trunk's current 277 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/console This message is automatically generated.
          Hide
          Olga Natkovich added a comment -

          patch committed. Thanks, Ying

          Show
          Olga Natkovich added a comment - patch committed. Thanks, Ying
          Olga Natkovich made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Ying He made changes -
          Link This issue is cloned as PIG-1000 [ PIG-1000 ]
          Alan Gates made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          4d 3h 1m 1 Olga Natkovich 29/Sep/09 01:28
          Open Open Patch Available Patch Available
          1h 23m 2 Olga Natkovich 29/Sep/09 01:29
          Patch Available Patch Available Resolved Resolved
          7d 17m 1 Olga Natkovich 06/Oct/09 01:46
          Resolved Resolved Closed Closed
          169d 21h 29m 1 Alan Gates 24/Mar/10 22:15

            People

            • Assignee:
              Ying He
              Reporter:
              Ying He
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development