Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I have a script which takes in a command line parameter.
pig -p number=100 script.pig
The script contains the following parameters:
A = load '/user/viraj/test' using PigStorage() as (a,b,c);
B = SAMPLE A 1/$number;
dump B;
Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
Ideal use case:
A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3); ... ... W = group X by col1; Z = foreach Y generate AVG(X); AA = load '/user/viraj/test' using PigStorage() as (a,b,c); BB = SAMPLE AA 1/Z; dump BB;
Viraj
Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012