Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14241

Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.6.1
    • 2.0.0
    • PySpark, Spark Core
    • None

    Description

      If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:

      http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

      From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lian cheng Cheng Lian
            pshearer Paul Shearer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment