Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12670

JoltTransform processors incorrectly encode/decode text in the Jolt Specification

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0-M1, 1.24.0, 1.25.0, 2.0.0-M2, 1.26.0, 2.0.0-M3
    • 2.0.0-M4
    • Configuration, Extensions
    • JVM with non-UTF-8 default encoding (e.g. default Windows installation)

    Description

      Environment

      This issue affects environments where the JVM default encoding is not UTF-8. Standard Java installations on Windows are affected, as they usually use the default encoding windows-1252. To reproduce the issue on Linux, change the default encoding to windows-1252 by adding the following line to your bootstrap.conf:

      java.arg.21=-Dfile.encoding=windows-1252

      Summary

      The Jolt Specification of both the JoltTransformJSON and JoltTransformRecord processors is read interally using the system default encoding, even though it is always stored in UTF-8. This causes non-ASCII characters to be garbled in the Jolt Specification, resulting in incorrect transformations (missing data or garbled keys).

      Steps to reproduce

      1. Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
      2. Create a GenerateFlowFile processor with the following content:
        Unknown macro: {   "regularString"}
      3. Connect the processor to a JoltTransformJSON and/or JoltTransformRecord processor.
        (If using the record based processor, use a default JsonTreeReader and JsonRecordSetWriter. The record reader/writer don't affect this bug.)
        Set the Jolt Specification to:

        [
          {
            "operation": "shift",
            "spec":

        Unknown macro: {       "regularString"}

          }
        ]

      4. Connect the outputs of the Jolt processor(s) to funnels to be able to observe the result in the queue.
      5. Start the Jolt processor(s) and run the GenerateFlowFile processor once.
        The flow should look similar to this:

        I also attached a JSON export of the example flow.
      6. Observe the content of the resulting FlowFile(s) in the queue.

      Expected Result

      Actual Result

      • Remapped key containing non-ASCII characters is garbled, since the key value originated from the Jolt Specification.
      • The key "keyWithÜmlaut" could not be matched at all, since it contains non-ASCII characters, resulting in missing data in the output.

      Root Cause Analysis

      Both processors use the readTransform method of AbstractJoltTransform to read the Jolt Specification property. This method uses an InputStreamReader without specifying an encoding, which then defaults to the default charset of the environment. Text properties are always encoded in UTF-8. When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in a garbled Jolt Specification being used.

      Workaround

      This issue is not present when any attribute expression language is present in the Jolt Specification. Simply adding ${literal('')} anywhere in the Jolt Specification works around this issue.

      This happens because a different code path is used when expression language is present.
      I don't know why the property is even read line-by-line using a stream reader when no expression language is present. It seems like just using getValue() would work fine even without expression language, and that method doesn't have the encoding bug.

      Attachments

        1. Jolt_Transform_Encoding_Bug.json
          22 kB
          René Zeidler
        2. Jolt_Transform_Encoding_Bug_M2.json
          22 kB
          René Zeidler
        3. image-2024-01-25-12-00-09-544.png
          6 kB
          René Zeidler
        4. image-2024-01-25-11-59-56-662.png
          5 kB
          René Zeidler
        5. image-2024-01-25-11-01-15-405.png
          101 kB
          René Zeidler

        Issue Links

          Activity

            People

              jrsteinebrey Jim Steinebrey
              Rene_Z René Zeidler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h