[NIFI-13843] Unknown fields not dropped by JSON Writer as expected by specified schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.27.0, 2.0.0-M4
Fix Version/s: 2.0.0
Component/s: Extensions
Labels:
None

Description

Consider the following use case:

GFF Processor, generating a JSON with 3 fields: a, b, and c
ConvertRecord with JSON Reader / JSON Writer
- Both reader and writer are configured with a schema only specifying fields a and b

The expected result is a JSON that only contains fields a and b.

We're following the below path in the code:

AbstractRecordProcessor (L131)

Record firstRecord = reader.nextRecord();

In this case, the default method for nextRecord() is defined in RecordReader (L50)

default Record nextRecord() throws IOException, MalformedRecordException {
    return nextRecord(true, false);
}

where we are NOT dropping the unknown fields (Java doc needs some fixing here as it is saying the opposite)

We get to

writer.write(firstRecord);

which gets us to

WriteJsonResult (L206)

Here, we do a check

isUseSerializeForm(record, writeSchema)

which currently returns true when it should not. Because of this we write the serialised form which ignores the writer schema.

In this method isUseSerializeForm(), we do check

record.getSchema().equals(writeSchema)

But at this point record.getSchema() returns the schema defined in the reader which is equal to the one defined in the writer - even though the record has additional fields compared to the defined schema.

The suggested fix is check is to also add a check on

record.isDropUnknownFields()

If dropUnknownFields is false, then we do not use the serialised form.

While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.

If we consider the below scenario:

GFF Processor, generating a JSON with 3 fields: a, b, and c
ConvertRecord with JSON Reader / JSON Writer
- JSON reader with a schema only specifying fields a and b
- JSON writer with a schema specifying fields a, b, and c (c defaulting to null)

It feels like the expected result should be a JSON with the field c and a null value, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.

If we agree on the above, then it may be easier to juste override nextRecord() in AbstractJsonRowRecordReader and default to nextRecord(true, true).

Attachments

Issue Links

duplicates

NIFI-13362 JSONRecordSetWriter does not account for schema changes when writing serialized form

Resolved

relates to

NIFI-13963 Unknown fields not dropped by JSON Writer as expected by specified schema

Resolved

links to

GitHub Pull Request #9347

Activity

People

Assignee:: Pierre Villard

Reporter:: Pierre Villard

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Oct/24 15:01

Updated:: 04/Nov/24 19:20

Resolved:: 29/Oct/24 19:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m