[PIG-3931] DUMP should limit how much data it emits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: grunt, impl
Labels:
- dump
- grunt
- inline
- limit
- nested
- operator

Release Note:

Hide
DUMP limits the amount of data it will emit. After pig.max_dump_bytes (default 1 million) bytes of data have been emitted, it outputs a warning and carries on executing. Also, you can use nested operators with DUMP, allowing you to write "DUMP (foo LIMIT 10);" to dump only the first ten rows.

Show
DUMP limits the amount of data it will emit. After pig.max_dump_bytes (default 1 million) bytes of data have been emitted, it outputs a warning and carries on executing. Also, you can use nested operators with DUMP, allowing you to write "DUMP (foo LIMIT 10);" to dump only the first ten rows.

Description

The DUMP command is fairly dangerous: leave a stray DUMP uncommented from debugging your script on reduced data and it will spew a terabyte of data into your console with no apology.

1. By (configurable) default, DUMP should not emit more than 1MB of data
2. The DUMP statement should accept a limit on rows

Safety Valve limit on output size

Pig should gain a pig.max_dump_bytes configuration variable imposing an approximate upper bound on how much data DUMP will emit. Since a GROUP BY statement can generate an extremely large bag, this safety valve limit should be bytes and not rows. I propose a default of 1,000,000 bytes – good for about 1000 records of 1k each. Pig should emit a warning to the console if the max_dump_bytes limit is hit.

This is a breaking change, but users shouldn't be using DUMP other than for experimentation. Pig should favor the experimentation use case, and let the foolhardy push the max_dump_bytes limit back up on their own.

DUMP can elegantly limit the number of rows

Right now I have to write the following annoyingly-wordy statement:

dumpable = LIMIT teams 10 ; DUMP dumpable;

One approach would be to allow DUMP to accept an inline (nested) operator. Assignment statements can have inline operators, but dump can't:

-- these work, which is so awesome:
some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id;
some = GROUP (LIMIT teams 10) BY park_id;
STORE (LIMIT teams 10) INTO '/tmp/some_teams';
-- these don't work, but maybe they should:
DUMP (LIMIT teams 10);
DUMP (GROUP teams BY team_id);

Alternatively, DUMP could accept an argument:

dumpable = DUMP teams LIMIT 10;
dumpable = DUMP teams LIMIT ALL;

The generated plan should be equivalent to that from `some = LIMIT teams 10 ; DUMP some` so that optimizations on LIMIT kick in.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Flip Kromer

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/May/14 01:03

Updated:: 24/Jun/14 19:34