Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-11

ReservoirSample does not behave as expected when grouping by a key other than ALL

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      Reported by Barbara Mucha (Issue #92 on GitHub):

      ReservoirSample does not behave as expected when grouping by a key other than ALL.

      It appears like the sample is done on the full input instead of the group input.

      Given input:

      a1,5
      a1,6
      a1,7
      a2,5
      a2,6
      a2,7
      

      with the following program

      DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
      data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray);
      grouped = GROUP data BY key;
      sample2 = FOREACH grouped GENERATE ReservoirSample(data);
      

      the expected output should be similar to

      (a1, {(a1,5),(a1,7)}
      (a2, {(a2,5),(a2,7)}
      

      However, actual output may show up as

      (a1, {(a1,5),(a1,7)}
      (a2, {(a1,5),(a1,7)}
      

        Attachments

        1. DATAFU-11-v2.patch
          10 kB
          Matthew Hayes
        2. DATAFU-11.patch
          10 kB
          Matthew Hayes

          Activity

            People

            • Assignee:
              mhayes Matthew Hayes
              Reporter:
              william.g.vaughan Will Vaughan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: