[SPARK-42005] SparkR cannot collect dataframe with NA in a date column along with another timestamp column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 3.3.0
Fix Version/s: 3.5.0
Component/s: SparkR
Labels:
None

Language:
- R
- scala

Description

This issue seems to be related with https://issues.apache.org/jira/browse/SPARK-17811, which was resolved by https://github.com/apache/spark/pull/15421 .

If there exists a column of data type `date` which is completely NA, and another column of data type `timestamp`, then SparkR cannot collect that Spark dataframe into R dataframe.

The reproducible code snippet is below.

df <- data.frame(x = as.Date(NA), y = as.POSIXct("2022-01-01"))
SparkR::collect(SparkR::createDataFrame(df))

#> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 1 times, most recent failure: Lost task 0.0 in stage 25.0 (TID 25) (ip-10-172-210-194.us-west-2.compute.internal executor driver): java.lang.IllegalArgumentException: Invalid type N
#> at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:94)
#> at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:68)
#> at #> org.apache.spark.sql.api.r.SQLUtils$.$anonfun$bytesToRow$1(SQLUtils.scala:129)
#> at org.apache.spark.sql.api.r.SQLUtils$.$anonfun$bytesToRow$1$adapted(SQLUtils.scala:128)
#> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
#> at scala.collection.immutable.Range.foreach(Range.scala:158)
#> ...

This issue does not appear If the column of `date` data type is not missing. Or if there does not exist any other column with data type as `timestamp`.

df <- data.frame(x = as.Date("2022-01-01"), y = as.POSIXct("2022-01-01"))
SparkR::collect(SparkR::createDataFrame(df))

#>            x             y                                                         
#> 1     2022-01-01    2022-01-01

df <- data.frame(x = as.Date(NA), y = as.character("2022-01-01"))
SparkR::collect(SparkR::createDataFrame(df))

#>            x             y
#> 1        <NA>       2022-01-01

Attachments

Issue Links

is related to

SPARK-17811 SparkR cannot parallelize data.frame with NA or NULL in Date columns

Resolved

SPARK-18011 SparkR serialize "NA" throws exception

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Vivek Atal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jan/23 03:56

Updated:: 31/Jan/23 06:26

Resolved:: 31/Jan/23 05:05