[PIG-2137] SAMPLE should not be pushed above DISTINCT - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.8.0, 0.8.1, 0.9.0, 0.10.0
Fix Version/s: 0.9.0, 0.10.0
Component/s: None
Labels:
None

Description

I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.

Script 1, using GROUP BY to get distinct entries in the data, works:

grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = foreach (group f by $0) generate group; 
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)

Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:

grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = distinct f;
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
(980)

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-2137.1.patch
23/Jun/11 00:58
4 kB
Thejas Nair
PIG-2137.2.patch
23/Jun/11 01:19
5 kB
Thejas Nair
PIG-2137.patch
22/Jun/11 04:31
4 kB
Dmitriy V. Ryaboy

Issue Links

is related to

PIG-2014 SAMPLE shouldn't be pushed up

Closed

Activity

People

Assignee:: Dmitriy V. Ryaboy

Reporter:: Dmitriy V. Ryaboy

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Jun/11 03:33

Updated:: 04/Aug/11 00:35

Resolved:: 24/Jun/11 02:39