Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14442

[R] fix behaviour when converting timestamps with "" as tzone

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • R

    Description

      Form the comments, we've decided to go with option 3:

      • Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
        This is surprising because we are asserting the local timezone when that is not specified in R.

      ============================================

      POSIXct in R can have timezones specified as "" which is typically interpreted as the session local timezone.

      This can lead to surprising results like:

      > Sys.timezone()
      [1] "America/Chicago"
      > as.integer(as.POSIXct("1970-01-01"))
      [1] 21600
      > Sys.setenv(TZ = "UTC")
      > as.integer(as.POSIXct("1970-01-01"))
      [1] 0
      > Sys.setenv(TZ = "Australia/Brisbane")
      > as.integer(as.POSIXct("1970-01-01"))
      [1] -36000
      

      See also: https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923

      This runs counter to what timestamps without timezones are interpreted as in Arrow: https://github.com/apache/arrow/blob/03669438bbce53078616c7f943a63fb0c11db196/format/Schema.fbs#L333-L336

      > However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.

      Critically in R, when as.POSIXct("1970-01-01 00:00:00") is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and not UTC like the Arrow spec says).

      This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using as.POSIXct("1970-01-01 00:00:00") as an example, and presume US Central time. We have a few options:

      • Warn when the timezone is "" or not set that the behavior might be surprising
        We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00")
      • Set the timezone to UTC without changing the integer value of the timestamp. We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.
      • Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
        This is surprising because we are asserting the local timezone when that is not specified in R.

      If someone is using a timestamp without tzone in R to represent a timezoneless timestamp, options 2 and 3 above violate that when it is put into Arrow. Whereas, if someone is using a timestamp that just so happens to be without a tzone but they assume it's in local time, option 1 leads to (very) surprising results

      Attachments

        Issue Links

          Activity

            People

              dragosmg Dragoș Moldovan-Grünfeld
              jonkeane Jonathan Keane
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 13h 50m
                  13h 50m