[NIFI-12670] JoltTransform processors incorrectly encode/decode text in the Jolt Specification - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0-M1, 1.24.0, 1.25.0, 2.0.0-M2, 1.26.0, 2.0.0-M3
Fix Version/s: 2.0.0-M4
Component/s: Configuration, Extensions
Labels:
- encoding
- jolt
- json
- utf8
- windows
Environment:
JVM with non-UTF-8 default encoding (e.g. default Windows installation)

Description

Environment

This issue affects environments where the JVM default encoding is not UTF-8. Standard Java installations on Windows are affected, as they usually use the default encoding windows-1252. To reproduce the issue on Linux, change the default encoding to windows-1252 by adding the following line to your bootstrap.conf:

java.arg.21=-Dfile.encoding=windows-1252

Summary

The Jolt Specification of both the JoltTransformJSON and JoltTransformRecord processors is read interally using the system default encoding, even though it is always stored in UTF-8. This causes non-ASCII characters to be garbled in the Jolt Specification, resulting in incorrect transformations (missing data or garbled keys).

Steps to reproduce

Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
Create a GenerateFlowFile processor with the following content:
Unknown macro: { "regularString"}
Connect the processor to a JoltTransformJSON and/or JoltTransformRecord processor.
(If using the record based processor, use a default JsonTreeReader and JsonRecordSetWriter. The record reader/writer don't affect this bug.)
Set the Jolt Specification to:
[
{
"operation": "shift",
"spec":

Unknown macro: { "regularString"}

}
]
Connect the outputs of the Jolt processor(s) to funnels to be able to observe the result in the queue.
Start the Jolt processor(s) and run the GenerateFlowFile processor once.
The flow should look similar to this:

I also attached a JSON export of the example flow.
Observe the content of the resulting FlowFile(s) in the queue.

Expected Result

Actual Result

Remapped key containing non-ASCII characters is garbled, since the key value originated from the Jolt Specification.
The key "keyWithÜmlaut" could not be matched at all, since it contains non-ASCII characters, resulting in missing data in the output.

Root Cause Analysis

Both processors use the readTransform method of AbstractJoltTransform to read the Jolt Specification property. This method uses an InputStreamReader without specifying an encoding, which then defaults to the default charset of the environment. Text properties are always encoded in UTF-8. When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in a garbled Jolt Specification being used.

Workaround

This issue is not present when any attribute expression language is present in the Jolt Specification. Simply adding ${literal('')} anywhere in the Jolt Specification works around this issue.

This happens because a different code path is used when expression language is present.
I don't know why the property is even read line-by-line using a stream reader when no expression language is present. It seems like just using getValue() would work fine even without expression language, and that method doesn't have the encoding bug.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-01-25-11-01-15-405.png
25/Jan/24 10:01
101 kB
René Zeidler
image-2024-01-25-11-59-56-662.png
25/Jan/24 10:59
5 kB
René Zeidler
image-2024-01-25-12-00-09-544.png
25/Jan/24 11:00
6 kB
René Zeidler
Jolt_Transform_Encoding_Bug_M2.json
02/Feb/24 10:23
22 kB
René Zeidler
Jolt_Transform_Encoding_Bug.json
25/Jan/24 10:02
22 kB
René Zeidler

Issue Links

is related to

NIFI-10666 PrometheusReportingTask does not use UTF-8 encoding on /metrics/ endpoint

Resolved

relates to

NIFI-12669 EvaluateXQuery processor incorrectly encodes result attributes

Resolved

links to

GitHub Pull Request #8842

Activity

People

Assignee:: Jim Steinebrey

Reporter:: René Zeidler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jan/24 11:12

Updated:: 17/May/24 13:57

Resolved:: 17/May/24 13:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h