Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14241

Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 1.6.1
    • Fix Version/s: 2.0.0
    • Component/s: PySpark, Spark Core
    • Labels:
      None

      Description

      If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:

      http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

      From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lian cheng Cheng Lian
                Reporter:
                pshearer Paul Shearer
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: