Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17752

Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • SparkR
    • None

    Description

      Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):

      SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
      Sys.setenv(SPARK_HOME = SPARK_HOME)
      
      library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
      sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
      
      n <- 1E3
      df <- as.data.frame(replicate(n, 1L, FALSE))
      names(df) <- paste("X", 1:n, sep = "")
      
      tbl <- as.DataFrame(df)
      cache(tbl) # works fine without this
      cl <- collect(tbl)
      
      identical(df, cl) # FALSE
      

      Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.

      For posterity:

      > sessionInfo()
      R version 3.3.1 Patched (2016-07-30 r71015)
      Platform: x86_64-apple-darwin13.4.0 (64-bit)
      Running under: macOS Sierra (10.12)

      Attachments

        Activity

          People

            shivaram Shivaram Venkataraman
            kevinushey Kevin Ushey
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: