[PIG-3900] SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- features
- random
- sample
- seed

Release Note:
SAMPLE and RANDOM can now optionally stabilize their output from run-to-run, even across a large input set.

Description

SAMPLE and RANDOM should be able to give output that is stable from run-to-run, yet random across a large input set. Although ~~PIG-2965~~ allows the RANDOM function to be constructed with a seed, each mapper will generate the same sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely non-deterministic. Testing becomes difficult, and failed map tasks don't provide the same output from attempt to attempt, which complicates debugging.

The most desirable implementation would provide a guarantee that a given seed and input data would produce an identical result in any environment. I believe this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the output would be stable for most practical purposes – as long as the assignment of input splits to mappers doesn't change from job to job, the number produced for each row won't change from job to job. Doing it this way would be backwards compatible with the current Pig 0.12.0 implementation (~~PIG-2965~~) in the case of a single mapper (which is the only justifiable use of the current seed feature). Alternatively, one could use a hash of the input file path, the split offset, and the provided seed. Both approaches are not stable if the splitCombination logic is not stable.

Suggested documentation for new functionality of RANDOM:

This example constructs a function, providing a seed to control the series of numbers generated. Each of the three fields will have an independent series of random values, and the output will be stable from run to run. (Note that the result is only stable if the input splits remain stable).
DEFINE rollRand  RANDOM('12345');
DEFINE yawRand   RANDOM('69');
DEFINE pitchRand RANDOM('42');
position = LOAD 'position.tsv';
orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand() AS pitch:double, yawRand() AS yaw:double;

Suggested documentation for new functionality of SAMPLE:

In this example, we provide a seed that stabilizes which rows are selected from run to run. (Note that the result is only stable if the input splits remain stable).
a = LOAD 'a.txt';
b = SAMPLE A 0.1 SEED 42;

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Flip Kromer

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Apr/14 23:42

Updated:: 19/May/15 09:09