[ARROW-8944] [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.17.0, 0.17.1
Fix Version/s: None
Component/s: Python
Labels:
None
Environment:
pandas==1.0.3
pyarrow==0.17.1
Python==3,7.6 @ Windows 10 64Bit

External issue URL:
https://github.com/apache/arrow/issues/25071

Description

The following pandas -> parquet -> pandas roudtrip raises an out of bounds timestamp error with pyarrow 0.17.0 and 0.17.1:

import pandas

target = 'ts_roundtrip.parquet'

dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
dataframe['timestamp'] = pandas.to_datetime(dataframe['timestamp'],errors='raise')

dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', '2020-03-02T03:03:17.791062Z','','']})
dataframe2['timestamp'] = pandas.to_datetime(dataframe2['timestamp'],errors='raise')
dataframe = dataframe.append(dataframe2)

print(dataframe.head(10))

dataframe.to_parquet(target, coerce_timestamps=None, index=False, version='2.0')

dataframe_new = pandas.read_parquet(target)
print(dataframe_new.head())

Output:

   id                         timestamp
0   1                               NaT
1   2                               NaT
2   3                               NaT
0   4                               NaT
1   5  2020-03-02 03:03:17.791062+00:00
2   6                               NaT
3   7                               NaT
Traceback (most recent call last):
  File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
    dataframe_new = pandas.read_parquet(target)
  File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, in read
    path, columns=columns, **kwargs
  File "pyarrow\array.pxi", line 587, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
  File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 766, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 1102, in _table_to_blocks
    list(extension_columns.keys()))
  File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
  File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000

Background:
We have a dataset with a timestamp column that is sparsely populated and originates from many json files. So it is very likely that in some of those json files there is no timestamp (as string in ISO format) and instead just an empty string. Each JSON file was read into a pandas dataframe, the timestamp column casted to datetime and all dataframes appended. That was done with pyarrow<0.17.0 and those parquet files cannot be read any longer and result in the above mentioned error message as well.

A closer look at our old parquets show that the NaTs are converted to "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe . You get the same result when you run the above code and pyarrow==0.16.0.

Attachments

Issue Links

is fixed by

ARROW-842 [Python] Handle more kinds of null sentinel objects from pandas 0.x

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Figus

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/20 17:43

Updated:: 11/Jan/23 11:03

Resolved:: 27/Oct/20 15:30

Agile

View on Board

[Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp