Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3979

group all performance, garbage collection, and incremental aggregation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.12.0, 0.11.1
    • 0.14.0
    • impl
    • None
    • Reviewed

    Description

      I have a PIG statement similar to:
      summary = foreach (group data ALL) generate
      COUNT(data.col1), SUM(data.col2), SUM(data.col2)
      , Moments(col3)
      , Moments(data.col4)

      There are a couple of hundred columns.

      I set the following:
      SET pig.exec.mapPartAgg true;
      SET pig.exec.mapPartAgg.minReduction 3;
      SET pig.cachedbag.memusage 0.05;

      I found that when I ran this on a JVM with insufficient memory, the process eventually timed out because of an infinite garbage collection loop.

      The problem was invariant to the memusage setting.

      I solved the problem by making changes to:
      org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java

      Rather than reading in 10000 records to establish an estimate of the reduction, I make an estimate after reading in enough tuples to fill pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().

      I also made a change to guarantee at least one record allowed in second tier storage. In the current implementation, if the reduction is very high 1000:1, space in second tier storage is zero.

      With these changes, I can summarize large data sets with small JVMs. I also find that setting pig.cachedbag.memusage to a small number such as 0.05 results in much better garbage collection performance without reducing throughput. I suppose tuning GC would also solve a problem with excessive garbage collection.

      The performance is sweet.

      Attachments

        1. PIG-3979-v1.patch
          3 kB
          David Dreyfus
        2. POPartialAgg.java.patch
          38 kB
          David Dreyfus
        3. SpillableMemoryManager.java.patch
          9 kB
          David Dreyfus
        4. PIG-3979-3.patch
          20 kB
          Daniel Dai
        5. PIG-3979-4.patch
          21 kB
          Daniel Dai
        6. PIG-3979-synchronous-spill.patch
          20 kB
          Rohini Palaniswamy
        7. PIG-3979-5.patch
          32 kB
          Rohini Palaniswamy
        8. PIG-3979-6.patch
          34 kB
          Rohini Palaniswamy
        9. PIG-3979-7.patch
          35 kB
          Rohini Palaniswamy

        Issue Links

          Activity

            People

              rohini Rohini Palaniswamy
              ddreyfus David Dreyfus
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: