Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Bug
-
None
-
None
-
None
Description
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file.
#!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field<created: timestamp[ns]> (nanosecond units) print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond units)
This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust".
This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459
Specifically the metadata written is
Schema { endianness: Little, fields: Some( [ Field { name: Some( "created", ), nullable: true, type_type: Timestamp, type_: Timestamp { unit: NANOSECOND, timezone: Some( "UTC", ), }, dictionary: None, children: Some( [], ), custom_metadata: None, }, ], ), custom_metadata: Some( [ KeyValue { key: Some( "pandas", ), value: Some( "{\"index_columns\": [], \"column_indexes\": [], \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}", ), }, ], ), features: None, }