[ARROW-8100] [Python] timestamp[ms] and date64 data types not working as expected on write - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.15.1, 0.16.0
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/24310

Description

I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see:

>>> arr = pa.array([datetime(2020, 12, 20)])

(have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) with no later casting as well)

>>> arr.cast(pa.timestamp('ms'), safe=False)

<pyarrow.lib.TimestampArray object at 0x117f3d4c8>
[
2020-12-20 00:00:00.000
]

>>> table = pa.Table.from_arrays([arr],

names=["start_date"])

>>> table
pyarrow.Table
start_date: timestamp[us]

// just to make sure

>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
[
[
2020-12-20 00:00:00.000
]
]

// just to make extra sure

>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])

>>> table.cast(schema, safe=False)parquet.write_table(table,

"sldkfjasldkfj.parquet",

coerce_timestamps="ms",

compression="SNAPPY",

allow_truncated_timestamps=True)

Result for the written file:

Schema:

{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "start_date",
"type" : [ "null",

{ "type" : "long", "logicalType" : "timestamp-millis" }

],
"default" : null
} ]
}

Data:

start_date
1608422400000

that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but:
>>> from datetime import datetime
>>>
>>>
>>>
>>>
>>> datetime.fromtimestamp(1608422400000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: year 52938 is out of range
>>> datetime.fromtimestamp(1608422400000 /1000)
datetime.datetime(2020, 12, 19, 16, 0)

Ok, so then we should use date64() type, after all the docs say Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)

>>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
>>> arr
<pyarrow.lib.Date64Array object at 0x11da877c8>
[
2020-12-20
]

>>> table = pa.Table.from_arrays([arr], names=["start_date"])
>>> table
pyarrow.Table

start_date: date64[ms]

parquet.write_table(table,

"bebedabeep.parquet",

coerce_timestamps="ms",

compression="SNAPPY",

allow_truncated_timestamps=True)

Result for the written file:

Schema:

{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "start_date",
"type" : [ "null",

{ "type" : "int", "logicalType" : "date" }

],
"default" : null
} ]
}

Data:

start_date
18616

That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this:
>>> arr.to_pylist()
[datetime.date(2020, 12, 20)]

How do I write a millisecond precision timestamp with pyarrow.parquet?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: paul hess

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Mar/20 19:54

Updated:: 11/Jan/23 07:58

Resolved:: 13/Mar/20 01:58