Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16686

Dataset.sample with seed: result seems to depend on downstream usage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 2.0.1, 2.1.0
    • SQL
    • None
    • Spark 1.6.2 and Spark 2.0 - RC4
      Standalone
      Single-worker cluster

    Description

      Summary to reproduce bug:

      • Create a DataFrame DF, and sample it with a fixed seed.
      • Collect that DataFrame -> result1
      • Call a particular UDF on that DataFrame -> result2

      You would expect results 1 and 2 to use the same rows from DF, but they appear not to.
      Note: result1 and result2 are both deterministic.

      See the attached notebook for details. Cells in the notebook were executed in order.

      Attachments

        1. DataFrame.sample bug - 2.0.html
          71 kB
          Joseph K. Bradley

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: