Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser strptime:
import datetime import pyarrow as pa import pyarrow.compute as pc >>> pa.array([datetime.datetime(2022, 3, 5, 9)]) <pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0> [ 2022-03-05 09:00:00.000000 ] # trying to parse the above representation as string >>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S", unit="us") ... ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar of type timestamp[us]
The reason for this is the fractional second part, so the following works:
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us") <pyarrow.lib.TimestampArray object at 0x7f00c1d6f940> [ 2022-03-05 09:00:00.000000 ]
Now, I think the reason that this fails is because strptime only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).
But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.
In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:
s = b"""a 2022-03-05 09:00:00.000000""" from pyarrow import csv >>> csv.read_csv(io.BytesIO(s)) pyarrow.Table a: timestamp[ns] ---- a: [[2022-03-05 09:00:00.000000000]]
I realize that you can use the generic "cast" for doing this string parsing:
>>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us")) <pyarrow.lib.TimestampArray object at 0x7f00c1d53d60> [ 2022-03-05 09:00:00.000000 ]
But this was not the first way I thought about (I think it is quite typical to first think of strptime, and it is confusing that that doesn't work; the error message is also not helpful)
cc apitrou rokm
Attachments
Issue Links
- duplicates
-
ARROW-9907 [Python] Failed to parse string into timestamp
- Closed
-
ARROW-10430 [C++][Python] strptime fails to parse subsecond timestamps
- Closed
- is a child of
-
ARROW-15894 [C++] Strptime issues umbrella
- Open