Description
Get double instead of POSIX in collect method for timestamp column datatype, when NA exists at the top of the column.
The following codes and outputs show that, how the bug can be reproduced:
> sparkR.session(master = "local") Spark package found in SPARK_HOME: /home/titicaca/spark-2.1 Launching java with spark-submit command /home/titicaca/spark-2.1/bin/spark-submit sparkr-shell /tmp/RtmpqmpZUg/backend_port363a898be92 Java ref type org.apache.spark.sql.SparkSession id 1 > df <- data.frame(col1 = c(0, 1, 2), + col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) > sdf1 <- createDataFrame(df) > print(dtypes(sdf1)) [[1]] [1] "col1" "double" [[2]] [1] "col2" "timestamp" > df1 <- collect(sdf1) > print(lapply(df1, class)) $col1 [1] "numeric" $col2 [1] "POSIXct" "POSIXt" > sdf2 <- filter(sdf1, "col1 > 0") > print(dtypes(sdf2)) [[1]] [1] "col1" "double" [[2]] [1] "col2" "timestamp" > df2 <- collect(sdf2) > print(lapply(df2, class)) $col1 [1] "numeric" $col2 [1] "numeric"
As we can see, the data type of col2 is converted to numberic unexpectedly in the collected local data frame df2