Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11303

sample (without replacement) + filter returns wrong results in DataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.1
    • 1.5.2, 1.6.0
    • SQL
    • None
    • pyspark local mode, linux.

    Description

      When sampling and then filtering DataFrame from python, we get inconsistent result when not caching the sampled DataFrame. This bug doesn't appear in spark 1.4.1.

      d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
      d_sampled = d.sample(False, 0.1, 1)
      print d_sampled.count()
      print d_sampled.filter('t = 1').count()
      print d_sampled.filter('t != 1').count()
      d_sampled.cache()
      print d_sampled.count()
      print d_sampled.filter('t = 1').count()
      print d_sampled.filter('t != 1').count()
      

      output:

      14
      7
      8
      14
      7
      7
      

      Attachments

        Activity

          People

            yanboliang Yanbo Liang
            yuvalt Yuval Tanny
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: