Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options.
By default, when reading json, we infer the data type of columns, and when specifying an explicit schema, we also by default infer the type of columns that are not specified in the explicit schema. The docs for unexpected_field_behavior:
> How JSON fields outside of explicit_schema (if given) are treated
But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns not in the schema.
Example code using pyarrow:
import io import pyarrow as pa from pyarrow import json s_json = """{"column":"2022-09-05T08:08:46.000"}""" opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))])) json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result:
pyarrow.Table
column: string
----
column: [["2022-09-05T08:08:46.000"]]
But when adding unexpected_field_behaviour="ignore", we actually get the expected parse error:
opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore") json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
gives
ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000
It might be this is specific to timestamps, I don't directly see a similar issue with eg "column": "A" and setting the schema to "column" being int64.
Attachments
Issue Links
- links to