Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4963

SchemaRDD.sample may return wrong results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.3.0
    • SQL
    • None

    Description

      This sbt/sbt hive/console session can easily reproduce this issue:

      sql("SELECT * FROM src WHERE key % 2 = 0").
        sample(withReplacement = false, fraction = 0.05).
        registerTempTable("sampled")
      
      println(table("sampled").queryExecution)
      
      val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
      println(query.queryExecution)
      
      // Should print `true'
      println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
      

      Notice that when fraction is less than 0.4, GapSamplingIterator is used to do the sampling. My guess is that there’s something to do with the underlying mutable row objects used in HiveTableScan, but haven't figured out the root cause.

      Attachments

        Activity

          People

            yanboliang Yanbo Liang
            lian cheng Cheng Lian
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: