Pig
  1. Pig
  2. PIG-975

Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None

      Description

      POPackage uses DefaultDataBag during reduce process to hold data. It is registered with SpillableMemoryManager and prone to OutOfMemoryException. It's better to pro-actively managers the usage of the memory. The bag fills in memory to a specified amount, and dump the rest the disk. The amount of memory to hold tuples is configurable. This can avoid out of memory error.

      1. PIG-975.patch4
        11 kB
        Ying He
      2. PIG-975.patch3
        10 kB
        Ying He
      3. PIG-975.patch2
        10 kB
        Ying He
      4. PIG-975.patch
        10 kB
        Ying He
      5. internalbag.xls
        80 kB
        Ying He

        Issue Links

          Activity

          Hide
          Ying He added a comment -

          implement a new bag and use it in POPackage

          Show
          Ying He added a comment - implement a new bag and use it in POPackage
          Hide
          Ying He added a comment -

          remove System.out.println

          Show
          Ying He added a comment - remove System.out.println
          Hide
          Olga Natkovich added a comment -

          Couple of questions comments on the patch:

          • Why do we need to synchronize in add. Who else is accessing the bag since it is no longer managed by spillable manager?
          • Memory fraction should be a java property so that users can control it they choose so
          • Why do we have limit of only 100 tuples in memory since we already have memory limit? Also, if we do need it, shouldn't it be configurable?
          Show
          Olga Natkovich added a comment - Couple of questions comments on the patch: Why do we need to synchronize in add. Who else is accessing the bag since it is no longer managed by spillable manager? Memory fraction should be a java property so that users can control it they choose so Why do we have limit of only 100 tuples in memory since we already have memory limit? Also, if we do need it, shouldn't it be configurable?
          Hide
          Ying He added a comment -

          Answer to Olga's questions:

          1. The synchronization can be removed.
          2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5
          3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory

          Show
          Ying He added a comment - Answer to Olga's questions: 1. The synchronization can be removed. 2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5 3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory
          Hide
          Olga Natkovich added a comment -

          Ok, then lets remove the synchronization. The rest looks good. Could you also put perf numbers that you see with this change

          Show
          Olga Natkovich added a comment - Ok, then lets remove the synchronization. The rest looks good. Could you also put perf numbers that you see with this change
          Hide
          Ying He added a comment -

          remove synchronization

          Show
          Ying He added a comment - remove synchronization
          Hide
          Ying He added a comment -

          performance numbers

          Show
          Ying He added a comment - performance numbers
          Hide
          Pradeep Kamath added a comment -

          I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.

          Show
          Pradeep Kamath added a comment - I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.
          Hide
          Ying He added a comment -

          I think this is too implementation specific to expose to end user. Frankly, I don't think user cares which class we use for the data bags.

          Show
          Ying He added a comment - I think this is too implementation specific to expose to end user. Frankly, I don't think user cares which class we use for the data bags.
          Hide
          Olga Natkovich added a comment -

          Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.

          Show
          Olga Natkovich added a comment - Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.
          Hide
          Ying He added a comment -

          Add switch to old bag. Setting property pig.cachedbag.type=default would switch to old default bag. If not specified, use InternalCachedBag.l

          Show
          Ying He added a comment - Add switch to old bag. Setting property pig.cachedbag.type=default would switch to old default bag. If not specified, use InternalCachedBag.l
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420603/PIG-975.patch4
          against trunk revision 819691.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 406 javac compiler warnings (more than the trunk's current 403 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          -1 release audit. The applied patch generated 278 release audit warnings (more than the trunk's current 277 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420603/PIG-975.patch4 against trunk revision 819691. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 406 javac compiler warnings (more than the trunk's current 403 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 release audit. The applied patch generated 278 release audit warnings (more than the trunk's current 277 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/10/console This message is automatically generated.
          Hide
          Olga Natkovich added a comment -

          patch committed. Thanks, Ying

          Show
          Olga Natkovich added a comment - patch committed. Thanks, Ying

            People

            • Assignee:
              Ying He
              Reporter:
              Ying He
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development