Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4819

RANDOM() udf can lead to missing or redundant records

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When RANDOM() value is used for grouping/distinct/etc, it breaks the mapreduce rule and can lead to redundant or missing records.

      Some discussion can be found in
      https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195

      We should make RANDOM less random so that it'll produce the same sequence of random values from the task retries.

        Attachments

        1. pig-4819-v02.patch
          10 kB
          Koji Noguchi
        2. pig-4819-v02_fix_v06.patch
          8 kB
          Koji Noguchi
        3. pig-4819-v02_fix_v05.patch
          8 kB
          Koji Noguchi
        4. pig-4819-v02_fix_v04.patch
          7 kB
          Koji Noguchi
        5. pig-4819-v02_fix_v03.patch
          7 kB
          Koji Noguchi
        6. pig-4819-v02_fix_v02.patch
          7 kB
          Koji Noguchi
        7. pig-4819-v02_fix_v01.patch
          7 kB
          Koji Noguchi
        8. pig-4819-v01.patch
          8 kB
          Koji Noguchi

          Activity

            People

            • Assignee:
              knoguchi Koji Noguchi
              Reporter:
              knoguchi Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: