Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5949

JSON format options should be part of plugin config; not session options

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.12.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Drill provides a JSON record reader. Drill provides two ways to configure this reader:

      • Using the JSON plugin configuration.
      • Using a set of session options.

      The plugin configuration defines the file suffix associated with JSON files. The session options are:

      • store.json.all_text_mode
      • store.json.read_numbers_as_double
      • store.json.reader.skip_invalid_records
      • store.json.reader.print_skipped_invalid_record_number

      Suppose I have to JSON files from different sources (and keep them in distinct directories.) For the one, I want to use all_text_mode off as the data is nicely formatted. Also, my numbers are fine, so I want read_numbers_as_double off.

      But, the other file is a mess and uses a rather ad-hoc format. So, I want these two options turned on.

      As it turns out I often query both files. Today, I must set the session options one way to query my "clean" file, then reverse them to query the "dirty" file.

      Next, I want to join the two files. How do I set the options one way for the "clean" file, and the other for the "dirty" file within the same query? Can't.

      Now, consider the text format plugin that can read CSV, TSV, PSV and so on. It has a variety of options. But, the are not session options; they are instead options in the plugin definition. This allows me to, say, have a plugin config for CSV-with-headers files that I get from source A, and a different plugin config for my CSV-without-headers files from source B.

      Suppose we applied the text reader technique to the JSON reader. We'd move the session options listed above into the JSON format plugin. Then, I can define one plugin for my "clean" files, and a different plugin config for my "dirty" files.

      What's more, I can then use table functions to adjust the format for each file as needed within a single query. Since table functions are part of a query, I can add them to a view that I define for the various JSON files.

      The result is a far simpler user experience than the tedium of resetting session options for every query.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Paul.Rogers Paul Rogers
              • Votes:
                1 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: