The final step in the ongoing "result set loader" saga is to merge the revised JSON reader into master. This reader does two key things:
- Demonstrates the prototypical "late schema" style of data reading (discover schema while reading).
- Implements many tricks and hacks to handle schema changes while loading.
- Shows that, even with all these tricks, the only true solution is to actually have a schema.
The new JSON reader:
- Uses an expanded state machine when parsing rather than the complex set of if-statements in the current version.
- Handles reading a run of nulls before seeing the first data value (as long as the data value shows up in the first record batch).
- Uses the result-set loader to generate fixed-size batches regardless of the complexity, depth of structure, or width of variable-length fields.
While the JSON reader itself is helpful, the key contribution is that it shows how to use the entire kit of parts: result set loader, projection framework, and so on. Since the projection framework can handle an external schema, it is also a handy foundation for the ongoing schema project.
Key work to complete after this merger will be to reconcile actual data with the external schema. For example, if we know a column is supposed to be a VarChar, then read the column as a VarChar regardless of the type JSON itself picks. Or, if a column is supposed to be a Double, then convert Int and String JSON values into Doubles.
The Row Set framework was designed to allow inserting custom column writers. This would be a great opportunity to do the work needed to create them. Then, use the new JSON framework to allow parsing a JSON field as a specified Drill type.