Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4819

RANDOM() udf can lead to missing or redundant records

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None
    • Reviewed

    Description

      When RANDOM() value is used for grouping/distinct/etc, it breaks the mapreduce rule and can lead to redundant or missing records.

      Some discussion can be found in
      https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195

      We should make RANDOM less random so that it'll produce the same sequence of random values from the task retries.

      Attachments

        1. pig-4819-v01.patch
          8 kB
          Koji Noguchi
        2. pig-4819-v02.patch
          10 kB
          Koji Noguchi
        3. pig-4819-v02_fix_v01.patch
          7 kB
          Koji Noguchi
        4. pig-4819-v02_fix_v02.patch
          7 kB
          Koji Noguchi
        5. pig-4819-v02_fix_v03.patch
          7 kB
          Koji Noguchi
        6. pig-4819-v02_fix_v04.patch
          7 kB
          Koji Noguchi
        7. pig-4819-v02_fix_v05.patch
          8 kB
          Koji Noguchi
        8. pig-4819-v02_fix_v06.patch
          8 kB
          Koji Noguchi

        Activity

          People

            knoguchi Koji Noguchi
            knoguchi Koji Noguchi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: