Pig
  1. Pig
  2. PIG-1249

Safe-guards against misconfigured Pig scripts without PARALLEL keyword

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

      In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

      pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
      pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

      The formula is very simple:

      #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

      This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

      Note that this is not a backward compatible change and set default_parallel to restore the value to 1
      Show
      In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems. In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control: pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB) pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999 The formula is very simple: #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer. This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script. Note that this is not a backward compatible change and set default_parallel to restore the value to 1

      Description

      It would be very useful for Pig to have safe-guards against naive scripts which process a lot of data without the use of PARALLEL keyword.

      We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.

      1. PIG-1249-4.patch
        9 kB
        Alan Gates
      2. PIG-1249.patch
        7 kB
        Jeff Zhang
      3. PIG-1249_5.patch
        10 kB
        Jeff Zhang
      4. PIG_1249_3.patch
        9 kB
        Jeff Zhang
      5. PIG_1249_2.patch
        8 kB
        Jeff Zhang

        Issue Links

          Activity

          Arun C Murthy created issue -
          Arun C Murthy made changes -
          Field Original Value New Value
          Link This issue relates to MAPREDUCE-1521 [ MAPREDUCE-1521 ]
          Olga Natkovich made changes -
          Fix Version/s 0.8.0 [ 12314562 ]
          Jeff Zhang made changes -
          Assignee Jeff Zhang [ zjffdu ]
          Jeff Zhang made changes -
          Attachment PIG-1249.patch [ 12444699 ]
          Jeff Zhang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 0.8.0 [ 12314562 ]
          Jeff Zhang made changes -
          Attachment PIG_1249_2.patch [ 12445332 ]
          Jeff Zhang made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Jeff Zhang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jeff Zhang made changes -
          Attachment PIG_1249_3.patch [ 12445559 ]
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alan Gates made changes -
          Attachment PIG-1249-4.patch [ 12446173 ]
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jeff Zhang made changes -
          Attachment PIG-1249_5.patch [ 12450579 ]
          Jeff Zhang made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Jeff Zhang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Olga Natkovich made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Olga Natkovich made changes -
          Release Note In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallelism), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

          In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

          pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
          pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

          The formula is very simple:

          #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

          This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.
          Olga Natkovich made changes -
          Release Note In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallelism), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

          In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

          pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
          pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

          The formula is very simple:

          #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

          This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.
          In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

          In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

          pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
          pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

          The formula is very simple:

          #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

          This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

          Note that this is not a backward compatible change and set default_parallel to restore the value to 1
          Olga Natkovich made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jeff Zhang
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development