Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5878

[Python][C++] Parquet reader not forward compatible for timestamps without timezone

    XMLWordPrintableJSON

    Details

      Description

      Timestamps without timezone which are written by pyarrow 0.14.0 cannot be read anymore as timestamps by earlier versions. The timestamp is read as an integer when reading in with pyarrow 0.13.0

      Looking at the parquet schemas, it seems that the logical type cannot be understood by the older versions, see below.

      File generation with pyarrow 0.14.0

      import datetime
      import pyarrow.parquet as pq
      import pandas as pd
      
      df = pd.DataFrame(
          {
              "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
              "datetime64_ts": pd.Series(
                  [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
                  dtype="datetime64[ns]",
              ),
          }
      )
      pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
      

      Reading with pyarrow 0.13.0

      In [1]: import pyarrow.parquet as pq
      
      In [2]: import pyarrow as pa
      
      In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
         ...:     table = pq.read_pandas(fd)
         ...:
      
      In [4]: table.to_pandas()
      Out[4]:
               datetime64             datetime64_ts
      0  1514764800000000 2018-01-01 00:00:00+01:00
      
      In [5]: table.to_pandas().dtypes
      Out[5]:
      datetime64                               int64
      datetime64_ts    datetime64[ns, Europe/Berlin]
      dtype: object
      

      Parquet schema as seen by pyarrow versions:

      pyarrow 0.13.0 parquet schema

      datetime64: INT64
      datetime64_ts: INT64 TIMESTAMP_MICROS
      

      pyarrow 0.14.0 parquet schema

      datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
      datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
      

        Attachments

        1. timezones_pyarrow_14.paquet
          1 kB
          Florian Jetter

          Issue Links

            Activity

              People

              • Assignee:
                bkietz Ben Kietzman
                Reporter:
                fjetter Florian Jetter
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 20m
                  5h 20m