Details
Description
The DUMP command is fairly dangerous: leave a stray DUMP uncommented from debugging your script on reduced data and it will spew a terabyte of data into your console with no apology.
1. By (configurable) default, DUMP should not emit more than 1MB of data
2. The DUMP statement should accept a limit on rows
Safety Valve limit on output size
Pig should gain a pig.max_dump_bytes configuration variable imposing an approximate upper bound on how much data DUMP will emit. Since a GROUP BY statement can generate an extremely large bag, this safety valve limit should be bytes and not rows. I propose a default of 1,000,000 bytes – good for about 1000 records of 1k each. Pig should emit a warning to the console if the max_dump_bytes limit is hit.
This is a breaking change, but users shouldn't be using DUMP other than for experimentation. Pig should favor the experimentation use case, and let the foolhardy push the max_dump_bytes limit back up on their own.
DUMP can elegantly limit the number of rows
Right now I have to write the following annoyingly-wordy statement:
dumpable = LIMIT teams 10 ; DUMP dumpable;
One approach would be to allow DUMP to accept an inline (nested) operator. Assignment statements can have inline operators, but dump can't:
-- these work, which is so awesome:
some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id;
some = GROUP (LIMIT teams 10) BY park_id;
STORE (LIMIT teams 10) INTO '/tmp/some_teams';
-- these don't work, but maybe they should:
DUMP (LIMIT teams 10);
DUMP (GROUP teams BY team_id);
Alternatively, DUMP could accept an argument:
dumpable = DUMP teams LIMIT 10; dumpable = DUMP teams LIMIT ALL;
The generated plan should be equivalent to that from `some = LIMIT teams 10 ; DUMP some` so that optimizations on LIMIT kick in.