[ARROW-18106] [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 11.0.0
Component/s: C++
Labels:
- json
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/33304

Description

Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options.

By default, when reading json, we infer the data type of columns, and when specifying an explicit schema, we also by default infer the type of columns that are not specified in the explicit schema. The docs for unexpected_field_behavior:

> How JSON fields outside of explicit_schema (if given) are treated

But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns not in the schema.

Example code using pyarrow:

import io
import pyarrow as pa
from pyarrow import json

s_json = """{"column":"2022-09-05T08:08:46.000"}"""

opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]))
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)

The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result:

pyarrow.Table
column: string
----
column: [["2022-09-05T08:08:46.000"]]

But when adding unexpected_field_behaviour="ignore", we actually get the expected parse error:

opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore")
json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)

gives

ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000

It might be this is specific to timestamps, I don't directly see a similar issue with eg "column": "A" and setting the schema to "column" being int64.

Attachments

Issue Links

links to

GitHub Pull Request #14741

Activity

People

Assignee:: Ben Harkins

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Oct/22 12:48

Updated:: 11/Jan/23 11:58

Resolved:: 14/Dec/22 17:47

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h