Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
None
-
None
-
None
Description
Right now, it's (IMHO) too easy to shoot yourself in the foot using 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers to function as a 'stable' primary key, for example, and then go on to use that key in e.g. 'joins' and so on.
Is there any reason why this function can't be made deterministic? Or, could a deterministic analogue of this function be added (e.g. 'withPrimaryKey(columnName = ...)')?
A solution is to immediately cache / persist the table after calling 'monotonicallyIncreasingId()'; it's also possible that the documentation should spell that out loud and clear.
Attachments
Issue Links
- duplicates
-
SPARK-14241 Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
- Resolved