Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-13843

Unknown fields not dropped by JSON Writer as expected by specified schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.27.0, 2.0.0-M4
    • 2.0.0
    • Extensions
    • None

    Description

      Consider the following use case:

      • GFF Processor, generating a JSON with 3 fields: a, b, and c
      • ConvertRecord with JSON Reader / JSON Writer
        • Both reader and writer are configured with a schema only specifying fields a and b

      The expected result is a JSON that only contains fields a and b.

      We're following the below path in the code:

      • AbstractRecordProcessor (L131)
      Record firstRecord = reader.nextRecord(); 

      In this case, the default method for nextRecord() is defined in RecordReader (L50)

      default Record nextRecord() throws IOException, MalformedRecordException {
          return nextRecord(true, false);
      } 

      where we are NOT dropping the unknown fields (Java doc needs some fixing here as it is saying the opposite)

      We get to 

      writer.write(firstRecord); 

      which gets us to

      • WriteJsonResult (L206)

      Here, we do a check

      isUseSerializeForm(record, writeSchema) 

      which currently returns true when it should not. Because of this we write the serialised form which ignores the writer schema.

      In this method isUseSerializeForm(), we do check

      record.getSchema().equals(writeSchema) 

      But at this point record.getSchema() returns the schema defined in the reader which is equal to the one defined in the writer - even though the record has additional fields compared to the defined schema.

      The suggested fix is check is to also add a check on

      record.isDropUnknownFields() 

      If dropUnknownFields is false, then we do not use the serialised form.

      While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.

      If we consider the below scenario:

      • GFF Processor, generating a JSON with 3 fields: ab, and c
      • ConvertRecord with JSON Reader / JSON Writer
        • JSON reader with a schema only specifying fields a and b
        • JSON writer with a schema specifying fields ab, and c (c defaulting to null)

      It feels like the expected result should be a JSON with the field c and a null value, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.

      If we agree on the above, then it may be easier to juste override nextRecord() in AbstractJsonRowRecordReader and default to nextRecord(true, true).

      Attachments

        Issue Links

          Activity

            People

              pvillard Pierre Villard
              pvillard Pierre Villard
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m