Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18106

[C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

Details

    Description

      Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options.

      By default, when reading json, we infer the data type of columns, and when specifying an explicit schema, we also by default infer the type of columns that are not specified in the explicit schema. The docs for unexpected_field_behavior:

      > How JSON fields outside of explicit_schema (if given) are treated

      But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns not in the schema.

      Example code using pyarrow:

      import io
      import pyarrow as pa
      from pyarrow import json
      
      s_json = """{"column":"2022-09-05T08:08:46.000"}"""
      
      opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]))
      json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
      

      The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result:

      pyarrow.Table
      column: string
      ----
      column: [["2022-09-05T08:08:46.000"]]
      

      But when adding unexpected_field_behaviour="ignore", we actually get the expected parse error:

      opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore")
      json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
      

      gives

      ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000
      

      It might be this is specific to timestamps, I don't directly see a similar issue with eg "column": "A" and setting the schema to "column" being int64.

      Attachments

        Issue Links

          Activity

            People

              benpharkins Ben Harkins
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h

                  Slack

                    Issue deployment