Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14241

Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.6.1
    • 2.0.0
    • PySpark, Spark Core
    • None

    Description

      If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:

      http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

      From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              pshearer Paul Shearer
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: