Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2026

[Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True

    XMLWordPrintableJSON

Details

    Description

      When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, timestamps are only written as 96-bit integers if the timestamp has nanosecond resolution. This is a problem because Amazon Redshift timestamps only have microsecond resolution but require them to be stored in 96-bit format in Parquet files.

      I'd expect the use_deprecated_int96_timestamps flag to cause all timestamps to be written as 96 bits, regardless of resolution. If this is a deliberate design decision, it'd be immensely helpful if it were explicitly documented as part of the argument.

       

      To reproduce:

       

      1. Create a table with a timestamp having microsecond or millisecond resolution, and save it to a Parquet file. Be sure to set `use_deprecated_int96_timestamps` to True.

       

      import datetime
      import pyarrow
      from pyarrow import parquet
      
      schema = pyarrow.schema([
          pyarrow.field('last_updated', pyarrow.timestamp('us')),
      ])
      
      data = [
          pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
      ]
      
      table = pyarrow.Table.from_arrays(data, ['last_updated'])
      
      with open('test_file.parquet', 'wb') as fdesc:
          parquet.write_table(table, fdesc,
                              use_deprecated_int96_timestamps=True)
      
      

       

      2. Inspect the file. I used parquet-tools:

       

      dak@tux ~ $ parquet-tools meta test_file.parquet
      file:         file:/Users/dak/test_file.parquet
      
      creator:      parquet-cpp version 1.3.2-SNAPSHOT
      
      
      
      file schema:  schema
      
      --------------------------------------------------------------------------------
      
      last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
      
      
      
      row group 1:  RC:1 TS:76 OFFSET:4
      
      --------------------------------------------------------------------------------
      
      last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE

       

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              yiannisliodakis Diego Argueta
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m