Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.0, 1.6.1
-
None
Description
If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:
From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.
Attachments
Issue Links
- is duplicated by
-
SPARK-17833 'monotonicallyIncreasingId()' should be deterministic
- Resolved
- relates to
-
SPARK-14393 values generated by non-deterministic functions shouldn't change after coalesce or union
- Resolved
-
SPARK-13473 Predicate can't be pushed through project with nondeterministic field
- Resolved