Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12750

ExecuteStreamCommand incorrectly decodes error stream

    XMLWordPrintableJSON

Details

    Description

      Summary

      The ExecuteStreamCommand processor stores everything the invoked command writes to the error stream (stderr) into the FlowFile attribute execution.error.

      When converting the bytes from the stream to a String, it interprets each individual byte as a Unicode codepoint. When reading only single bytes this effectively results in ISO-8859-1 (Latin-1).

      Instead, it should use the system default encoding (like it already does for writing stdout if Output Destination Attribute is set) or use a configurable encoding (for both stdout and stderr).

      Details

      When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding issues are the responsibility of the flow designer, and NiFi has the ConvertCharacterSet processor to deal with those issues.

      When writing to attributes, the API uses Java String objects, which are encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, processors receiving bytes have to interpret them using an encoding.

      The ExecuteStreamCommand processor writes the output of the command (stdout) to the Output Destination Attribute (if set). To do that, it convertes bytes into a String using the system default encoding* by calling new String without an encoding argument:
      https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499

      When converting stderr to a String to write into the execution.error attribute, it uses this weird algorithm:
      https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517
      It reads individual bytes from the error stream (as ints) and casts them to chars. What Java does in this case is interpret the integer as a Unicode code point. For single bytes, this matches the ISO-8859-1 encoding. Instead, it should use the same decoding method as for stdout.

      Reproduction steps

      These steps are for a Linux environment, but can be adapted with a different executable for Windows.

      1. Create the file /opt/nifi/data/encodingTest.sh (attached) with the following contents and make it executable:

      #/bin/bash
      echo "|out static: ÄÖÜäöüß"
      echo "|error static: ÄÖÜäöüß" >&2echo "|out arg: $1"
      echo "|error arg: $1" >&2echo "|out arg hexdump:"
      printf '%s' "$1" | od -A x -t x1z -v
      echo "|error arg hexdump:" >&2
      printf '%s' "$1" | od -A x -t x1z -v >&2

      The script writes identical data to both stdout and stderr. It contains non-ASCII characters to make the encoding issues visible.

      1. Import the attached flow or create it manually:
      1. Run the GenerateFlowFile processor once and observe the attributes of the FlowFile in the final queue:

        The output attribute (stdout) is correctly decoded. The execution.error attribute (stderr) contains garbled text (UTF-8 bytes interpreted as ISO-8859-1 and reencoded in UTF-8).

      *On the system default encoding

      The system default encoding is a property of the JVM. It is UTF-8 on Linux, but Windows-1252 (or a different copepage depending on locale) in Windows environments. It can be overriden using the file.encoding JVM arg on startup.

      Relying on the system default encoding is dangerous and can lead to subtle bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670).

      In this case, it might make sense to use the system default encoding, as it concerns data passed between NiFi and another process that runs on the host system. Also, the ProcessBuilder class used the create the process always passes arguments in the system default encoding, and there doesn't seem a way to change that. This behavior should probably be documented.

      Attachments

        1. image-2024-02-07-15-20-11-684.png
          32 kB
          René Zeidler
        2. image-2024-02-07-15-14-54-841.png
          25 kB
          René Zeidler
        3. image-2024-02-07-15-14-08-518.png
          72 kB
          René Zeidler
        4. ExecuteStreamCommand_Encoding_Bug.json
          10 kB
          René Zeidler
        5. encodingTest.sh
          0.3 kB
          René Zeidler

        Activity

          People

            Unassigned Unassigned
            Rene_Z René Zeidler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: