Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23520

Add support for MapType fields in JSON schema inference

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.1
    • None
    • Spark Core, SQL

    Description

      InferSchema currently does not support inferring MapType fields from JSON data, and for a good reason: they are indistinguishable from structs in JSON format. In issue SPARK-23494, I proposed to expose some methods of InferSchema to users so that they can build on top of the inference primitives defined by this class. In this issue, I'm proposing to add more control to the user by letting them specify a set of fields that should be forced as MapType.

      Use-case

      Some JSON datasets contain high-cardinality fields, namely fields which key space is very large. These fields shouldn't be interpreted as StructType for the following reasons:

      • it's not really what they are. The key space as well as the value space may both be infinite, so what best defines the schema of this data is the type of the keys and the type of the values, not a struct containing all possible key-value pairs.
      • interpreting high-cardinality fields as structs can lead to enormous schemata that don't even fit into memory.

      Proposition

      We would add a public overloaded signature for InferSchema.inferField which allows to pass a set of field accessors (a class that supports representing the access to any JSON field, including nested ones) for which we wan't do not want to recurse and instead force a schema. That would allow, in particular, to ask that a few fields be inferred as maps rather than structs.

      I am very open to discuss this with people who are more well-versed in the Spark codebase than me, because I realize my proposition can feel somewhat patchy. I'll be more than happy to provide some development effort if we manage to sketch a reasonably easy solution.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dicee David Courtinot
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: