[NIFI-5938] Allow Record Readers to Infer Schema on Read - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.9.0
Component/s: Extensions
Labels:
None

Description

The introduction of record-oriented processors was a huge improvement for NiFi in terms of usability. However, they only improve usability if you have a schema for your data. There have been several comments along the lines of "I would really love to use the record-oriented processors, but I don't have a schema for my data."

Sometimes users have no schema because they don't want to bother with creating the schemas. The schema becomes a usability issue. This is especially true for very large documents that contain a lot of nested Records. Other times, users cannot create a schema because they retrieve arbitrary data from some source, and they have no idea what the data will look like.

We do not want to remove the notion of a schema, however. Schemas provide for a very powerful construct for many use cases, and it provides Processors a much easier-to-use API. If we provide the ability to Infer the Schema on Read, though, we can provide the best of both worlds. While we do have processors for inferring schemas for JSON and CSV data, those are not always sufficient. They cannot be used, for instance, by ConsumeKafkaRecord, ExecuteSQL, etc. because those Processors need the schema before that. Additionally, we have no ability to infer a schema for XML, logs, etc.

Finally, we need to consider processors that are designed to manipulate the data. For example, UpdateRecord, JoltTransformRecord, LookupRecord (when used for enrichment), and QueryRecord. These Processors follow a typical pattern of "get reader's schema, then provide it to the writer in order to get writer's schema." This means that if the Record Writer inherits the record's schema, and we infer that schema, then any newly added fields will simply be dropped by the writer because the writer's schema doesn't know about those fields. As a result, we need to ensure that we first transform the first record, get the schema for the transformed record, and then pass that transformed record's schema to the Writer, so that the Writer inherits the schema describing data after transformation.

Design/Implementation Goals should include:

High performance: users should be impacted as little as is feasible.
Usability: users should be able to infer schemas with as little configuration as is reasonable.
Ease of Development: code should be written in a way that makes it easy for new Record Readers to provide schema inference that is fast, efficient, correct, and consistent with how the other readers infer schemas.
Implementations: At a minimum, we should provide the ability to infer schemas for JSON, XML, and CSV data.
Backward Compatibility: The new feature should not break backward compatibility for any Record Reader.

Attachments

Issue Links

links to

GitHub Pull Request #3253

Activity

People

Assignee:: Mark Payne

Reporter:: Mark Payne

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Jan/19 23:45

Updated:: 11/Feb/19 18:01

Resolved:: 11/Feb/19 18:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h