Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16184

[Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Bug
    • None
    • None
    • Python
    • None

    Description

      As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file.

      #!/usr/bin/env python
      
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      
      # create DataFrame with a datetime column
      df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
      df['created'] = pd.to_datetime(df['created'])
      
      # create Arrow table from DataFrame
      table = pa.Table.from_pandas(df, preserve_index=False)
      
      # write the table as a parquet file, then read it back again
      pq.write_table(table, 'foo.parquet')
      table2 = pq.read_table('foo.parquet')
      
      print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond units)
      print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond units)
      

      This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust".

      This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459

      Specifically the metadata written is

      Schema {
          endianness: Little,
          fields: Some(
              [
                  Field {
                      name: Some(
                          "created",
                      ),
                      nullable: true,
                      type_type: Timestamp,
                      type_: Timestamp {
                          unit: NANOSECOND,
                          timezone: Some(
                              "UTC",
                          ),
                      },
                      dictionary: None,
                      children: Some(
                          [],
                      ),
                      custom_metadata: None,
                  },
              ],
          ),
          custom_metadata: Some(
              [
                  KeyValue {
                      key: Some(
                          "pandas",
                      ),
                      value: Some(
                          "{\"index_columns\": [], \"column_indexes\": [], \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
                      ),
                  },
              ],
          ),
          features: None,
      } 

      Attachments

        Activity

          People

            Unassigned Unassigned
            tustvold Raphael Taylor-Davies
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: