Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16010

[R] write_parquet alters <dttm> value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Information Provided
    • 6.0.0
    • None
    • R
    • None
    • Ubuntu focal
      R 4.1.1
      RStudio 1.4.1772

    Description

      When we write a dataframe column of type `<dttm>` to parquet using the arrow package, subsequent reading in of the parquet file to dataframe returns a slightly different value.

      This behaviour does not replicate with columns of type `<double>`

       

      Reprex:

       

       
      
      #Create sample dataframe
      n <-  1631494810.376999855041503906250000000000000000000000000000000000
      df <- data.frame(x = "a",
                       n = n,
                       t = as.POSIXct(n, origin = "1970-01-01"))
      #Write to disk
      df %>% write_parquet("/tmp/tmp.parquet")
      
      
      #Extract time-based cols
      dft <- df %>% 
        filter(x == "a") %>% 
        pull(t) %>% 
        as.numeric 
      
      pqt <- read_parquet("/tmp/tmp.parquet") %>% 
        filter(x == "a") %>% 
        pull(t) %>% 
        as.numeric 
      dft == pqt
      sprintf("%.54f",dft)
      sprintf("%.54f",pqt)
      
      #Extract numeric cols
      dfn <- df %>% 
        filter(x == "a") %>% 
        pull(n) %>% 
        as.numeric 
      
      pqn <- read_parquet("/tmp/tmp.parquet") %>% 
        filter(x == "a") %>% 
        pull(n) %>% 
        as.numeric 
      dfn == pqn
      sprintf("%.54f",dfn)
      sprintf("%.54f",pqn) 

       

      The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` returns TRUE.

       

      Why is this a problem? We use `arrow` to store dataframes to disk. When we want to update these parquet files, we first check whether any data has actually changed and put in place tripwires to ensure that if a significant proportion of the data has changed the pipeline fails and is flagged for manual review.

       

      With the current behaviour, above, all of the dataframes that contain `<dttm>` type columns are failing.

      Attachments

        Activity

          People

            Unassigned Unassigned
            riazarbi Riaz Arbi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: