Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-11

ReservoirSample does not behave as expected when grouping by a key other than ALL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None

    Description

      Reported by Barbara Mucha (Issue #92 on GitHub):

      ReservoirSample does not behave as expected when grouping by a key other than ALL.

      It appears like the sample is done on the full input instead of the group input.

      Given input:

      a1,5
      a1,6
      a1,7
      a2,5
      a2,6
      a2,7
      

      with the following program

      DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
      data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray);
      grouped = GROUP data BY key;
      sample2 = FOREACH grouped GENERATE ReservoirSample(data);
      

      the expected output should be similar to

      (a1, {(a1,5),(a1,7)}
      (a2, {(a2,5),(a2,7)}
      

      However, actual output may show up as

      (a1, {(a1,5),(a1,7)}
      (a2, {(a1,5),(a1,7)}
      

      Attachments

        1. DATAFU-11.patch
          10 kB
          Matthew Hayes
        2. DATAFU-11-v2.patch
          10 kB
          Matthew Hayes

        Activity

          People

            mhayes Matthew Hayes
            william.g.vaughan Will Vaughan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: