Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Invalid
-
3.1.2
-
None
-
None
Description
I was looking into a question on StackOverflow where a user had an error due to a mismatch of field types (https://stackoverflow.com/questions/68032084/how-to-fix-error-in-readbin-using-arrow-package-with-sparkr/). I've made a reproducible example below.
library(SparkR) library(arrow) SparkR::sparkR.session(sparkConfig = list(spark.sql.execution.arrow.sparkr.enabled = "true")) readr::write_csv(iris, "iris.csv")dfSchema <- structType( structField("Sepal.Length", "int"), structField("Sepal.Width", "double"), structField("Petal.Length", "double"), structField("Petal.Width", "double"), structField("Species", "string") ) spark_df <- SparkR::read.df(path="iris.csv", source="csv", schema=dfSchema) # Apply an R native function to each partition. returnSchema <- structType(structField("Sep_add_one", "int")) df <- SparkR::dapply(spark_df, function(rdf){ data.frame(rdf$Sepal.Length + 1) }, returnSchema ) collect(df)
It results in this error:
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument
I've never used SparkR before, so could be wrong, but I think it may be due to the use of this line in DataFrame.R:
arrowTable <- arrow::read_ipc_stream(readRaw(conn))
readRaw assumes that the the value is an integer; however when returnSchema is being created, the code creates a double (the code runs without error if it's updated to `rdf$Sepal.Length + 1L`)
I wasn't sure if perhaps readTypedObject needs to be used or if maybe some type checking might be useful?
Apologies for the partial explanation - not entirely sure of the cause.