Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Reported by Barbara Mucha (Issue #92 on GitHub):
ReservoirSample does not behave as expected when grouping by a key other than ALL.
It appears like the sample is done on the full input instead of the group input.
Given input:
a1,5 a1,6 a1,7 a2,5 a2,6 a2,7
with the following program
DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray); grouped = GROUP data BY key; sample2 = FOREACH grouped GENERATE ReservoirSample(data);
the expected output should be similar to
(a1, {(a1,5),(a1,7)} (a2, {(a2,5),(a2,7)}
However, actual output may show up as
(a1, {(a1,5),(a1,7)} (a2, {(a1,5),(a1,7)}