Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4965

[Python] Timestamp array type detection should use tzname of datetime.datetime objects

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • Python
    • None

    Description

      The type detection from datetime objects to array appears to ignore the presence of a tzinfo on the datetime object, instead storing them as naive timestamp columns.

      Python code:

      import datetime
      import pytz
      import pyarrow as pa
      
      naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
      utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
      tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))
      
      def inspect(varname):
          print(varname)
          arr = globals()[varname]
          print(arr.type)
          print(arr)
          print()
      
      auto_naive_arr = pa.array([naive_datetime])
      inspect("auto_naive_arr")
      
      auto_utc_arr = pa.array([utc_datetime])
      inspect("auto_utc_arr")
      
      auto_tzaware_arr = pa.array([tzaware_datetime])
      inspect("auto_tzaware_arr")
      
      auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
      inspect("auto_mixed_arr")
      
      naive_type = pa.timestamp("us", naive_datetime.tzname())
      utc_type = pa.timestamp("us", utc_datetime.tzname())
      tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())
      
      naive_arr = pa.array([naive_datetime], type=naive_type)
      inspect("naive_arr")
      
      utc_arr = pa.array([utc_datetime], type=utc_type)
      inspect("utc_arr")
      
      tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
      inspect("tzaware_arr")
      
      mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
      inspect("mixed_arr")
      

      This prints:

      $ python detect_timezone.py
      auto_naive_arr
      timestamp[us]
      [
        1547381470000000
      ]
      
      auto_utc_arr
      timestamp[us]
      [
        1547381470000000
      ]
      
      auto_tzaware_arr
      timestamp[us]
      [
        1547352670000000
      ]
      
      auto_mixed_arr
      timestamp[us]
      [
        1547381470000000,
        1547352670000000
      ]
      
      naive_arr
      timestamp[us]
      [
        1547381470000000
      ]
      
      utc_arr
      timestamp[us, tz=UTC]
      [
        1547381470000000
      ]
      
      tzaware_arr
      timestamp[us, tz=PST]
      [
        1547352670000000
      ]
      
      mixed_arr
      timestamp[us, tz=UTC]
      [
        1547381470000000,
        1547352670000000
      ]
      

      But I would expect the following types instead:

      • naive_datetime: timestamp[us]
      • auto_utc_arr: timestamp[us, tz=UTC]
      • auto_tzaware_arr: timestamp[us, tz=PST] (Or maybe tz='America/Los_Angeles'. I'm not sure why pytz returns PST as the tzname)
      • auto_mixed_arr: timestamp[us, tz=UTC]

      Also, in the "mixed" case, I'd expect the actual stored microseconds to be the same for both rows, since utc_datetime and tzaware_datetime both refer to the same point in time. It seems reasonable for any naive datetime objects mixed in with tz-aware datetimes to be interpreted as UTC.

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              tswast Tim Swast
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: