Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45216

Fix non-deterministic seeded Dataset APIs

    XMLWordPrintableJSON

Details

    Description

      If we run the following example the result is the expected equal 2 columns:

      val c = rand()
      df.select(c, c)
      
      +--------------------------+--------------------------+
      |rand(-4522010140232537566)|rand(-4522010140232537566)|
      +--------------------------+--------------------------+
      |        0.4520819282997137|        0.4520819282997137|
      +--------------------------+--------------------------+
      

       
      But if we run use other similar APIs their result is incorrect:

      val r1 = random()
      val r2 = uuid()
      val r3 = shuffle(col("x"))
      val x = df.select(r1, r1, r2, r2, r3, r3)
      
      +------------------+------------------+--------------------+--------------------+----------+----------+
      |            rand()|            rand()|              uuid()|              uuid()|shuffle(x)|shuffle(x)|
      +------------------+------------------+--------------------+--------------------+----------+----------+
      |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...| [1, 2, 3]| [2, 1, 3]|
      +------------------+------------------+--------------------+--------------------+----------+----------+
      

      Attachments

        Issue Links

          Activity

            People

              petertoth Peter Toth
              petertoth Peter Toth
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: