Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15382

monotonicallyIncreasingId doesn't work when data is upsampled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.1
    • 2.0.1, 2.1.0
    • SQL
    • None

    Description

      Assigned ids are not unique

      from pyspark.sql import Row
      from pyspark.sql.functions import monotonicallyIncreasingId
      
      hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
      

      Output:

      [Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=1, id=429496729600),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792),
       Row(a=2, id=867583393792)]
      

      Attachments

        Issue Links

          Activity

            People

              maropu Takeshi Yamamuro
              sixers Mateusz Buśkiewicz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: