Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14241

Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 1.6.1
    • Fix Version/s: 2.0.0
    • Component/s: PySpark, Spark Core
    • Labels:
      None

      Description

      If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph:

      http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

      From a user perspective this behavior is very unexpected, and many things one would normally like to do with an ID column are in fact only possible under very narrow circumstances. The function should either be made deterministic, or there should be a prominent warning note in the API docs regarding its behavior.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              lian cheng Cheng Lian
              Reporter:
              pshearer Paul Shearer

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment