Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48184

Always set the seed of dataframe.sample in Client side

    XMLWordPrintableJSON

Details

    Description

      the output dataframe of `sample` is not immutable in Spark Connect

       

      In Spark Classic:

      In [1]: df = spark.range(10000).sample(0.1)
      In [2]: [df.count() for i in range(10)]
      Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]

       

      In Spark Connect:

      In [1]: df = spark.range(10000).sample(0.1)
      In [2]: [df.count() for i in range(10)]
      Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
       

       

       

      Attachments

        Issue Links

          Activity

            People

              podongfeng Ruifeng Zheng
              podongfeng Ruifeng Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: