[SPARK-17752] Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: SparkR
Labels:
None

Description

Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary):

SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE

Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)

Attachments

Activity

People

Assignee:: Shivaram Venkataraman

Reporter:: Kevin Ushey

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Sep/16 22:14

Updated:: 08/Oct/16 09:21

Resolved:: 03/Oct/16 19:23