Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3900

SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • SAMPLE and RANDOM can now optionally stabilize their output from run-to-run, even across a large input set.

    Description

      SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large input set. Although PIG-2965 allows the RANDOM function to be constructed with a seed, each mapper will generate the same sequence of values, which is unacceptable.

      It's typically undesirable to have the output of a large job be completely non-deterministic. Testing becomes difficult, and failed map tasks don't provide the same output from attempt to attempt, which complicates debugging.

      The most desirable implementation would provide a guarantee that a given seed and input data would produce an identical result in any environment. I believe this is difficult in a distributed environment, however.

      If each mapper added the index of its task ID to the provided seed, then the output would be stable for most practical purposes – as long as the assignment of input splits to mappers doesn't change from job to job, the number produced for each row won't change from job to job. Doing it this way would be backwards compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of a single mapper (which is the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input file path, the split offset, and the provided seed. Both approaches are not stable if the splitCombination logic is not stable.

      Suggested documentation for new functionality of RANDOM:

      This example constructs a function, providing a seed to control the series of numbers generated. Each of the three fields will have an independent series of random values, and the output will be stable from run to run. (Note that the result is only stable if the input splits remain stable).

      DEFINE rollRand  RANDOM('12345');
      DEFINE yawRand   RANDOM('69');
      DEFINE pitchRand RANDOM('42');
      position = LOAD 'position.tsv';
      orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand() AS pitch:double, yawRand() AS yaw:double;
      

      Suggested documentation for new functionality of SAMPLE:

      In this example, we provide a seed that stabilizes which rows are selected from run to run. (Note that the result is only stable if the input splits remain stable).

      a = LOAD 'a.txt';
      b = SAMPLE A 0.1 SEED 42;
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            mrflip Flip Kromer

            Dates

              Created:
              Updated:

              Slack

                Issue deployment