SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large input set. Although
PIG-2965 allows the RANDOM function to be constructed with a seed, each mapper will generate the same sequence of values, which is unacceptable.
It's typically undesirable to have the output of a large job be completely non-deterministic. Testing becomes difficult, and failed map tasks don't provide the same output from attempt to attempt, which complicates debugging.
The most desirable implementation would provide a guarantee that a given seed and input data would produce an identical result in any environment. I believe this is difficult in a distributed environment, however.
If each mapper added the index of its task ID to the provided seed, then the output would be stable for most practical purposes – as long as the assignment of input splits to mappers doesn't change from job to job, the number produced for each row won't change from job to job. Doing it this way would be backwards compatible with the current Pig 0.12.0 implementation (
PIG-2965) in the case of a single mapper (which is the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input file path, the split offset, and the provided seed. Both approaches are not stable if the splitCombination logic is not stable.
Suggested documentation for new functionality of RANDOM:
This example constructs a function, providing a seed to control the series of numbers generated. Each of the three fields will have an independent series of random values, and the output will be stable from run to run. (Note that the result is only stable if the input splits remain stable).
Suggested documentation for new functionality of SAMPLE:
In this example, we provide a seed that stabilizes which rows are selected from run to run. (Note that the result is only stable if the input splits remain stable).