Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3543

[R] Better support for timestamp format and time zones in R

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 0.17.0
    • R

    Description

      See below for original description and reports. In sum, there is a mismatch between how the C++ library and R interpret data without a timezone, and it turns out that we're not passing the timezone to R if it is set in Arrow C++ anyway.

      The C++ library docs say "If a timezone-aware field contains a recognized timezone, its values may be localized to that locale upon display; the values of timezone-naive fields must always be displayed “as is”, with no localization performed on them." But R's print default, as well as the parsing default, is the current time zone: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

      The C++ library seems to parse timestamp strings that don't have timezone information as if they are UTC, so when you read timezone-naive timestamps from Arrow and print them in R, they are shifted to be localized to the current timezone. If you print timestamp data from Arrow with print(timestamp_var, tz="GMT") it would look as you expect.

      On further inspection, the arrow-to-vector code for timestamp doesn't seem to consider time zone information even if it does exist. So we don't have the means currently in R to display timestamp data faithfully, whether or not it is timezone-aware.

      Among the tasks here:

      • Include the timezone attribute in the POSIXct R vector that gets created from a timestamp Arrow array
      • Ensure that timezone-naive data from Arrow is printed in R "as is" with no localization

      Original description:

      Hello the dream team,

      Pasting from https://github.com/wesm/feather/issues/351

      Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug.

      Consider this

       

      import pandas as pd
      import feather
      import numpy as np
      df = pd.DataFrame(
      {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]}
      )
      df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
      df
       Out[17]: 
       string_time_utc timestamp_est
       0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
       1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
       2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
      

      Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time).

      Now saving the dataframe to `csv` or to `feather` will generate two completely different results.

       

      df.to_csv('P://testing.csv')
      df.to_feather('P://testing.feather')
      

      Switching to R.

      Using the good old `csv` gives me something a bit annoying, but expected. R thinks my timezone is `UTC` by default, and wrongly attached this timezone to `timestamp_est`. No big deal, I can always use `with_tz` or even better: import as character and process as timestamp while in R.

       

      > dataframe <- read_csv('P://testing.csv')
       Parsed with column specification:
       cols(
       X1 = col_integer(),
       string_time_utc = col_datetime(format = ""),
       timestamp_est = col_datetime(format = "")
       )
       Warning message:
       Missing column names filled in: 'X1' [1] 
       > 
       > dataframe %>% mutate(mytimezone = tz(timestamp_est))
      
      A tibble: 3 x 4
       X1 string_time_utc timestamp_est 
       <int> <dttm> <dttm> 
       1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
       2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
       3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
       mytimezone
       <chr> 
       1 UTC 
       2 UTC 
       3 UTC  
      #Now look at what happens with feather:
       
       > dataframe <- read_feather('P://testing.feather')
       > 
       > dataframe %>% mutate(mytimezone = tz(timestamp_est))
      
      A tibble: 3 x 3
       string_time_utc timestamp_est mytimezone
       <dttm> <dttm> <chr> 
       1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
       2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
       3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" 

      My timestamps have been converted!!! pure insanity.
      Am I missing something here?

      Thanks!!

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              Olafsson Olaf
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m