Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
See below for original description and reports. In sum, there is a mismatch between how the C++ library and R interpret data without a timezone, and it turns out that we're not passing the timezone to R if it is set in Arrow C++ anyway.
The C++ library docs say "If a timezone-aware field contains a recognized timezone, its values may be localized to that locale upon display; the values of timezone-naive fields must always be displayed “as is”, with no localization performed on them." But R's print default, as well as the parsing default, is the current time zone: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
The C++ library seems to parse timestamp strings that don't have timezone information as if they are UTC, so when you read timezone-naive timestamps from Arrow and print them in R, they are shifted to be localized to the current timezone. If you print timestamp data from Arrow with print(timestamp_var, tz="GMT") it would look as you expect.
On further inspection, the arrow-to-vector code for timestamp doesn't seem to consider time zone information even if it does exist. So we don't have the means currently in R to display timestamp data faithfully, whether or not it is timezone-aware.
Among the tasks here:
- Include the timezone attribute in the POSIXct R vector that gets created from a timestamp Arrow array
- Ensure that timezone-naive data from Arrow is printed in R "as is" with no localization
Original description:
Hello the dream team,
Pasting from https://github.com/wesm/feather/issues/351
Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug.
Consider this
import pandas as pd import feather import numpy as np df = pd.DataFrame( {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]} ) df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) df Out[17]: string_time_utc timestamp_est 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time).
Now saving the dataframe to `csv` or to `feather` will generate two completely different results.
df.to_csv('P://testing.csv') df.to_feather('P://testing.feather')
Switching to R.
Using the good old `csv` gives me something a bit annoying, but expected. R thinks my timezone is `UTC` by default, and wrongly attached this timezone to `timestamp_est`. No big deal, I can always use `with_tz` or even better: import as character and process as timestamp while in R.
> dataframe <- read_csv('P://testing.csv') Parsed with column specification: cols( X1 = col_integer(), string_time_utc = col_datetime(format = ""), timestamp_est = col_datetime(format = "") ) Warning message: Missing column names filled in: 'X1' [1] > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) A tibble: 3 x 4 X1 string_time_utc timestamp_est <int> <dttm> <dttm> 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 mytimezone <chr> 1 UTC 2 UTC 3 UTC
#Now look at what happens with feather:
> dataframe <- read_feather('P://testing.feather')
>
> dataframe %>% mutate(mytimezone = tz(timestamp_est))
A tibble: 3 x 3
string_time_utc timestamp_est mytimezone
<dttm> <dttm> <chr>
1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 ""
2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 ""
3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 ""
My timestamps have been converted!!! pure insanity.
Am I missing something here?
Thanks!!
Attachments
Issue Links
- links to