Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-8204

Allow Provided Schema for HTTP Plugin in JSON Mode

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.20.0
    • 1.21.0
    • Storage - Other
    • None

    Description

      One of the challenges of querying APIs is inconsistent data. Drill allows you to provide a schema for individual endpoints. You can do this in one of two ways: either by
      providing a serialized TupleMetadata of the desired schema. This is an advanced functionality and should only be used by advanced Drill users.

      The schema provisioning currently supports complex types of Arrays and Maps at any nesting level.

          1. Example Schema Provisioning:
            ```json
            "jsonOptions": {
            "providedSchema": [ { "fieldName": "int_field", "fieldType": "bigint" }

            , {
            "fieldName": "jsonField",
            "fieldType": "varchar",
            "properties":

            { "drill.json-mode":"json" }

            },

            { // Array field "fieldName": "stringField", "fieldType": "varchar", "isArray": true }

            , {
            // Map field
            "fieldName": "mapField",
            "fieldType": "map",
            "fields": [

            { "fieldName": "nestedField", "fieldType": "int" }

            ,

            { "fieldName": "nestedField2", "fieldType": "varchar" }

            ]
            }
            ]
            }
            ```

          1. Example Provisioning the Schema with a JSON String
            ```json
            "jsonOptions":
            Unknown macro: {"jsonSchema"}

            "
            }
            ```

      You can print out a JSON string of a schema with the Java code below.

      ```java
      TupleMetadata schema = new SchemaBuilder()
      .addNullable("a", MinorType.BIGINT)
      .addNullable("m", MinorType.VARCHAR)
      .build();
      ColumnMetadata m = schema.metadata("m");
      m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);

      System.out.println(schema.jsonString());
      ```

      This will generate something like the JSON string below:

      ```json
      {
      "type":"tuple_schema",
      "columns":[

      {"name":"a","type":"BIGINT","mode":"OPTIONAL"}

      ,

      {"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}

      }
      ]
      }
      ```

        1. Dealing With Inconsistent Schemas
          One of the major challenges of interacting with JSON data is when the schema is inconsistent. Drill has a `UNION` data type which is marked as experimental. At the time of
          writing, the HTTP plugin does not support the `UNION`, however supplying a schema can solve a lot of those issues.
          1. Json Mode
            Drill offers the option of reading all JSON values as a string. While this can complicate downstream analytics, it can also be a more memory-efficient way of reading data with
            inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only available with a provided schema. However, future work will allow this mode to be enabled for
            any JSON data.
            1. Enabling JSON Mode:
              You can enable JSON mode simply by adding the `drill.json-mode` property with a value of `json` to a field, as shown below:

      ```json
      {
      "fieldName": "jsonField",
      "fieldType": "varchar",
      "properties":

      { "drill.json-mode": "json" }

      }
      ```

      Attachments

        Activity

          People

            cgivre Charles Givre
            cgivre Charles Givre
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: