[SPARK-14241] Output of monotonically_increasing_id lacks stable relation with rows of DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0, 1.6.1
Fix Version/s: 2.0.0
Component/s: PySpark, Spark Core
Labels:
None

Description

If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.

Attachments

Issue Links

is duplicated by

SPARK-17833 'monotonicallyIncreasingId()' should be deterministic

Resolved

relates to

SPARK-14393 values generated by non-deterministic functions shouldn't change after coalesce or union

Resolved

SPARK-13473 Predicate can't be pushed through project with nondeterministic field

Resolved

Activity

People

Assignee:: Cheng Lian

Reporter:: Paul Shearer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Mar/16 13:50

Updated:: 03/Jan/18 14:02

Resolved:: 02/Nov/16 19:05