Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9020

[Python] read_json won't respect explicit_schema in parse_options

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.17.1
    • 1.0.0
    • Python
    • CPython 3.8.2, MacOS Mojave 10.14.6

    Description

      I am trying to read a json file using an explicit schema but it looks like the schema is ignored. Moreover, if the my schema contains a field not present in the json file, then the output table contains all the fields in the json file plus the fields of my schema not found in the file.

      A minimal example:

      import pyarrow as pa
      from pyarrow import json
      
      # allowing for type inference
      print(json.read_json('tmp.json'))
      # prints:
      # pyarrow.Table
      # foo: string
      # baz: string
      
      # using an explicit schema that would read only "foo"
      schema = pa.schema([('foo', pa.string())])
      print(json.read_json('tmp.json', parse_options=json.ParseOptions(explicit_schema=schema)))
      # prints:
      # pyarrow.Table
      # foo: string
      # baz: string
      
      # using an explicit schema that would read only "not_a_field",
      # which is not present in the json file
      schema = pa.schema([('not_a_field', pa.string())])
      print(json.read_json('tmp.json', parse_options=json.ParseOptions(explicit_schema=schema)))
      # prints:
      # pyarrow.Table
      # not_a_field: string
      # foo: string
      # baz: string
      

      And the tmp.json file looks like:

      {"foo": "bar", "baz": "1"}
      
      

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              felipegssantos Felipe Santos
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h