Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4967

[C++] Parquet: Object type and stats lost when using 96-bit timestamps

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.12.1
    • Fix Version/s: None
    • Component/s: C++, Python
    • Labels:
    • Environment:
      PyArrow: 0.12.1
      Python: 2.7.15, 3.7.2
      Pandas: 0.24.2

      Description

      Run the following code:

      import datetime as dt
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
      table = pa.Table.from_pandas(dataframe, preserve_index=False)
      
      pq.write_table(table, 'int64.parq')
      pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
      

      Examining the int64.parq file, we see that the column metadata includes an object type of TIMESTAMP_MICROS and also gives some stats. All is well.

      file schema: schema 
      --------------------------------------------------------------------------------
      foo:         OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
      
      row group 1: RC:1 TS:76 OFFSET:4 
      --------------------------------------------------------------------------------
      foo:          INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 2019-12-31T23:59:59.999000, num_nulls: 0]
      

      However, if we look at int96.parq, it appears that that metadata is lost. No object type, and no column stats.

      file schema: schema 
      --------------------------------------------------------------------------------
      foo:         OPTIONAL INT96 R:0 D:1
      
      row group 1: RC:1 TS:58 OFFSET:4 
      --------------------------------------------------------------------------------
      foo:          INT96 SNAPPY ... ST:[no stats for this column]
      

      This is a bit confusing since the metadata for the exact same data can look differently depending on an unrelated flag being set or cleared.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              yiannisliodakis Diego Argueta
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: