Affects Version/s: None
Fix Version/s: None
Release Note:DUMP limits the amount of data it will emit. After pig.max_dump_bytes (default 1 million) bytes of data have been emitted, it outputs a warning and carries on executing. Also, you can use nested operators with DUMP, allowing you to write "DUMP (foo LIMIT 10);" to dump only the first ten rows.
The DUMP command is fairly dangerous: leave a stray DUMP uncommented from debugging your script on reduced data and it will spew a terabyte of data into your console with no apology.
1. By (configurable) default, DUMP should not emit more than 1MB of data
2. The DUMP statement should accept a limit on rows
Pig should gain a pig.max_dump_bytes configuration variable imposing an approximate upper bound on how much data DUMP will emit. Since a GROUP BY statement can generate an extremely large bag, this safety valve limit should be bytes and not rows. I propose a default of 1,000,000 bytes – good for about 1000 records of 1k each. Pig should emit a warning to the console if the max_dump_bytes limit is hit.
This is a breaking change, but users shouldn't be using DUMP other than for experimentation. Pig should favor the experimentation use case, and let the foolhardy push the max_dump_bytes limit back up on their own.
Right now I have to write the following annoyingly-wordy statement:
One approach would be to allow DUMP to accept an inline (nested) operator. Assignment statements can have inline operators, but dump can't:
Alternatively, DUMP could accept an argument:
The generated plan should be equivalent to that from `some = LIMIT teams 10 ; DUMP some` so that optimizations on LIMIT kick in.