[NIFI-12669] EvaluateXQuery processor incorrectly encodes result attributes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.27.0, 2.0.0-M4
Component/s: Configuration, Extensions
Labels:
- encoding
- utf8
- windows
- xml
Environment:
JVM with non-UTF-8 default encoding (e.g. default Windows installation)

Description

Environment

This issue affects environments where the JVM default encoding is not UTF-8. Standard Java installations on Windows are affected, as they usually use the default encoding windows-1252. To reproduce the issue on Linux, change the default encoding to windows-1252 by adding the following line to your bootstrap.conf:

java.arg.21=-Dfile.encoding=windows-1252

Summary

The EvaluateXQuery incorrectly encodes result values when storing them in attributes. This causes non-ASCII characters to be garbled.
Example:

Steps to reproduce

Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
Create a GenerateFlowFile processor with the following content:
<?xml version="1.0" encoding="UTF-8"?>
<myRoot>
<myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>
</myRoot>
Connect the processor to an EvaluateXQuery processor.
Set the Destination to flowfile-attribute.
Create a custom property myData with value string(/myRoot/myData).
Connect the outputs of the EvaluateXQuery processor to funnels to be able to observe the result in the queue.
Start the EvaluateXQuery processor and run the GenerateFlowFile processor once.
The flow should look similar to this:

I also attached a JSON export of the example flow.
Observe the attributes of the resulting FlowFile in the queue.

Expected Result

The FlowFile should contain an attribute myData with the value "This text contains non-ASCII characters: ÄÖÜäöüßéèóò".

Actual Result

The attribute has the value "This text contains non-ASCII characters: Ã„Ã–ÃœÃ¤Ã¶Ã¼ÃŸÃ©Ã¨Ã³Ã²".

Root Cause Analysis

EvaluateXQuery uses the method formatItem to write the query result to an attribute. This method calls ByteArrayOutputStream's toString method without an encoding argument, which then defaults to the default charset of the environment. Bytes are always written to this output stream using UTF-8 (.getBytes(StandardCharsets.UTF8)). When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in garbled text (see above).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

EvaluateXQuery_Encoding_Bug.json
25/Jan/24 09:22
8 kB
René Zeidler
image-2024-01-25-10-24-17-005.png
25/Jan/24 09:24
6 kB
René Zeidler
image-2024-01-25-10-31-35-200.png
25/Jan/24 09:31
59 kB
René Zeidler

Issue Links

is related to

NIFI-10666 PrometheusReportingTask does not use UTF-8 encoding on /metrics/ endpoint

Resolved

NIFI-12670 JoltTransform processors incorrectly encode/decode text in the Jolt Specification

Resolved

links to

GitHub Pull Request #8826

Activity

People

Assignee:: Jim Steinebrey

Reporter:: René Zeidler

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Jan/24 09:54

Updated:: 27/Jun/24 14:48

Resolved:: 14/May/24 15:23

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h