Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.0-M1, 1.24.0, 1.25.0, 2.0.0-M2, 1.26.0, 2.0.0-M3
-
JVM with non-UTF-8 default encoding (e.g. default Windows installation)
Description
Environment
This issue affects environments where the JVM default encoding is not UTF-8. Standard Java installations on Windows are affected, as they usually use the default encoding windows-1252. To reproduce the issue on Linux, change the default encoding to windows-1252 by adding the following line to your bootstrap.conf:
java.arg.21=-Dfile.encoding=windows-1252
Summary
The Jolt Specification of both the JoltTransformJSON and JoltTransformRecord processors is read interally using the system default encoding, even though it is always stored in UTF-8. This causes non-ASCII characters to be garbled in the Jolt Specification, resulting in incorrect transformations (missing data or garbled keys).
Steps to reproduce
- Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
- Create a GenerateFlowFile processor with the following content:
Unknown macro: { "regularString"}
- Connect the processor to a JoltTransformJSON and/or JoltTransformRecord processor.
(If using the record based processor, use a default JsonTreeReader and JsonRecordSetWriter. The record reader/writer don't affect this bug.)
Set the Jolt Specification to:[
{
"operation": "shift",
"spec":Unknown macro: { "regularString"}}
] - Connect the outputs of the Jolt processor(s) to funnels to be able to observe the result in the queue.
- Start the Jolt processor(s) and run the GenerateFlowFile processor once.
The flow should look similar to this:
I also attached a JSON export of the example flow. - Observe the content of the resulting FlowFile(s) in the queue.
Expected Result
Actual Result
- Remapped key containing non-ASCII characters is garbled, since the key value originated from the Jolt Specification.
- The key "keyWithÜmlaut" could not be matched at all, since it contains non-ASCII characters, resulting in missing data in the output.
Root Cause Analysis
Both processors use the readTransform method of AbstractJoltTransform to read the Jolt Specification property. This method uses an InputStreamReader without specifying an encoding, which then defaults to the default charset of the environment. Text properties are always encoded in UTF-8. When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in a garbled Jolt Specification being used.
Workaround
This issue is not present when any attribute expression language is present in the Jolt Specification. Simply adding ${literal('')} anywhere in the Jolt Specification works around this issue.
This happens because a different code path is used when expression language is present.
I don't know why the property is even read line-by-line using a stream reader when no expression language is present. It seems like just using getValue() would work fine even without expression language, and that method doesn't have the encoding bug.
Attachments
Attachments
Issue Links
- is related to
-
NIFI-10666 PrometheusReportingTask does not use UTF-8 encoding on /metrics/ endpoint
- Resolved
- relates to
-
NIFI-12669 EvaluateXQuery processor incorrectly encodes result attributes
- Resolved
- links to