Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3931

DUMP should limit how much data it emits

    Details

    • Release Note:
      Hide
      DUMP limits the amount of data it will emit. After pig.max_dump_bytes (default 1 million) bytes of data have been emitted, it outputs a warning and carries on executing. Also, you can use nested operators with DUMP, allowing you to write "DUMP (foo LIMIT 10);" to dump only the first ten rows.
      Show
      DUMP limits the amount of data it will emit. After pig.max_dump_bytes (default 1 million) bytes of data have been emitted, it outputs a warning and carries on executing. Also, you can use nested operators with DUMP, allowing you to write "DUMP (foo LIMIT 10);" to dump only the first ten rows.

      Description

      The DUMP command is fairly dangerous: leave a stray DUMP uncommented from debugging your script on reduced data and it will spew a terabyte of data into your console with no apology.

      1. By (configurable) default, DUMP should not emit more than 1MB of data
      2. The DUMP statement should accept a limit on rows

      Safety Valve limit on output size

      Pig should gain a pig.max_dump_bytes configuration variable imposing an approximate upper bound on how much data DUMP will emit. Since a GROUP BY statement can generate an extremely large bag, this safety valve limit should be bytes and not rows. I propose a default of 1,000,000 bytes – good for about 1000 records of 1k each. Pig should emit a warning to the console if the max_dump_bytes limit is hit.

      This is a breaking change, but users shouldn't be using DUMP other than for experimentation. Pig should favor the experimentation use case, and let the foolhardy push the max_dump_bytes limit back up on their own.

      DUMP can elegantly limit the number of rows

      Right now I have to write the following annoyingly-wordy statement:

      dumpable = LIMIT teams 10 ; DUMP dumpable;
      

      One approach would be to allow DUMP to accept an inline (nested) operator. Assignment statements can have inline operators, but dump can't:

      -- these work, which is so awesome:
      some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id;
      some = GROUP (LIMIT teams 10) BY park_id;
      STORE (LIMIT teams 10) INTO '/tmp/some_teams';
      -- these don't work, but maybe they should:
      DUMP (LIMIT teams 10);
      DUMP (GROUP teams BY team_id);
      

      Alternatively, DUMP could accept an argument:

      dumpable = DUMP teams LIMIT 10;
      dumpable = DUMP teams LIMIT ALL;
      

      The generated plan should be equivalent to that from `some = LIMIT teams 10 ; DUMP some` so that optimizations on LIMIT kick in.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mrflip Philip (flip) Kromer
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: