Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Information Provided
-
6.0.0
-
None
-
None
-
Ubuntu focal
R 4.1.1
RStudio 1.4.1772
Description
When we write a dataframe column of type `<dttm>` to parquet using the arrow package, subsequent reading in of the parquet file to dataframe returns a slightly different value.
This behaviour does not replicate with columns of type `<double>`
Reprex:
#Create sample dataframe n <- 1631494810.376999855041503906250000000000000000000000000000000000 df <- data.frame(x = "a", n = n, t = as.POSIXct(n, origin = "1970-01-01")) #Write to disk df %>% write_parquet("/tmp/tmp.parquet") #Extract time-based cols dft <- df %>% filter(x == "a") %>% pull(t) %>% as.numeric pqt <- read_parquet("/tmp/tmp.parquet") %>% filter(x == "a") %>% pull(t) %>% as.numeric dft == pqt sprintf("%.54f",dft) sprintf("%.54f",pqt) #Extract numeric cols dfn <- df %>% filter(x == "a") %>% pull(n) %>% as.numeric pqn <- read_parquet("/tmp/tmp.parquet") %>% filter(x == "a") %>% pull(n) %>% as.numeric dfn == pqn sprintf("%.54f",dfn) sprintf("%.54f",pqn)
The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` returns TRUE.
Why is this a problem? We use `arrow` to store dataframes to disk. When we want to update these parquet files, we first check whether any data has actually changed and put in place tripwires to ensure that if a significant proportion of the data has changed the pipeline fails and is flagged for manual review.
With the current behaviour, above, all of the dataframes that contain `<dttm>` type columns are failing.