Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5878

[Python][C++] Parquet reader not forward compatible for timestamps without timezone

    XMLWordPrintableJSON

Details

    Description

      Timestamps without timezone which are written by pyarrow 0.14.0 cannot be read anymore as timestamps by earlier versions. The timestamp is read as an integer when reading in with pyarrow 0.13.0

      Looking at the parquet schemas, it seems that the logical type cannot be understood by the older versions, see below.

      File generation with pyarrow 0.14.0

      import datetime
      import pyarrow.parquet as pq
      import pandas as pd
      
      df = pd.DataFrame(
          {
              "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
              "datetime64_ts": pd.Series(
                  [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
                  dtype="datetime64[ns]",
              ),
          }
      )
      pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
      

      Reading with pyarrow 0.13.0

      In [1]: import pyarrow.parquet as pq
      
      In [2]: import pyarrow as pa
      
      In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
         ...:     table = pq.read_pandas(fd)
         ...:
      
      In [4]: table.to_pandas()
      Out[4]:
               datetime64             datetime64_ts
      0  1514764800000000 2018-01-01 00:00:00+01:00
      
      In [5]: table.to_pandas().dtypes
      Out[5]:
      datetime64                               int64
      datetime64_ts    datetime64[ns, Europe/Berlin]
      dtype: object
      

      Parquet schema as seen by pyarrow versions:

      pyarrow 0.13.0 parquet schema

      datetime64: INT64
      datetime64_ts: INT64 TIMESTAMP_MICROS
      

      pyarrow 0.14.0 parquet schema

      datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
      datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
      

      Attachments

        1. timezones_pyarrow_14.paquet
          1 kB
          Florian Jetter

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              fjetter Florian Jetter
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 20m
                  5h 20m