Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12669

EvaluateXQuery processor incorrectly encodes result attributes

    XMLWordPrintableJSON

Details

    Description

      Environment

      This issue affects environments where the JVM default encoding is not UTF-8. Standard Java installations on Windows are affected, as they usually use the default encoding windows-1252. To reproduce the issue on Linux, change the default encoding to windows-1252 by adding the following line to your bootstrap.conf:

      java.arg.21=-Dfile.encoding=windows-1252

      Summary

      The EvaluateXQuery incorrectly encodes result values when storing them in attributes. This causes non-ASCII characters to be garbled.
      Example:

      Steps to reproduce

      1. Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
      2. Create a GenerateFlowFile processor with the following content:

        <?xml version="1.0" encoding="UTF-8"?>
        <myRoot>
          <myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>
        </myRoot>

      3. Connect the processor to an EvaluateXQuery processor.
        Set the Destination to flowfile-attribute.
        Create a custom property myData with value string(/myRoot/myData).
      4. Connect the outputs of the EvaluateXQuery processor to funnels to be able to observe the result in the queue.
      5. Start the EvaluateXQuery processor and run the GenerateFlowFile processor once.
        The flow should look similar to this:

        I also attached a JSON export of the example flow.
      6. Observe the attributes of the resulting FlowFile in the queue.

      Expected Result

      The FlowFile should contain an attribute myData with the value "This text contains non-ASCII characters: ÄÖÜäöüßéèóò".

      Actual Result

      The attribute has the value "This text contains non-ASCII characters: ÄÖÜäöüßéèóò".

      Root Cause Analysis

      EvaluateXQuery uses the method formatItem to write the query result to an attribute. This method calls ByteArrayOutputStream's toString method without an encoding argument, which then defaults to the default charset of the environment. Bytes are always written to this output stream using UTF-8 (.getBytes(StandardCharsets.UTF8)). When the default charset is not UTF-8, this results in UTF-8 bytes to be interpreted in a different encoding when converting to a string, resulting in garbled text (see above).

      Attachments

        1. EvaluateXQuery_Encoding_Bug.json
          8 kB
          René Zeidler
        2. image-2024-01-25-10-24-17-005.png
          6 kB
          René Zeidler
        3. image-2024-01-25-10-31-35-200.png
          59 kB
          René Zeidler

        Issue Links

          Activity

            People

              jrsteinebrey Jim Steinebrey
              Rene_Z René Zeidler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h