[PIG-1713] SAMPLE command should accept parameters to specify alternative sampling algorithm - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- gsoc2012

Description

I have a script which takes in a command line parameter.

pig -p number=100 script.pig

The script contains the following parameters:

A = load '/user/viraj/test' using PigStorage() as (a,b,c);

B = SAMPLE A 1/$number;

dump B;

Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

Ideal use case:

A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

...
...

W = group X by col1;

Z = foreach Y generate AVG(X);

AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

BB = SAMPLE AA 1/Z;

dump BB;

Viraj

Change this Jira to only track sampling algorithm. ~~PIG-1926~~ is opened to track limit/sample taking scalar.

This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

Attachments

Issue Links

relates to

PIG-3221 Bootstrap sampling

Open

PIG-3224 Reservoir sampling

Open

PIG-3225 Stratified sampling

Open

Activity

People

Assignee:: Unassigned

Reporter:: Viraj Bhat

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Nov/10 00:32

Updated:: 25/Apr/13 20:07