Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12708

UnpackContent should allow the user to specify a character set to apply in reading paths and filenames

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-M3
    • None
    • None

    Description

      https://apachenifi.slack.com/archives/C0L9VCD47/p1706716977280569

      Timon Faerber
      1 hour ago
      I am currently struggling with an encoding problem for unzipped files.
      The following:
      I have a .zip in my content, which Im not aware of how it was created (dont know Character Set).
      Then I use UnpackContent processor.
      The path (folder) and filename is after that for unpacked files not encoded in UTF-8 and the characters are output as ?.
      I have already tried this solutions like https://community.cloudera.com/t5/Support-Questions/Unable-to-write-a-file-with-Chinese-Characters-filename-in/m-p/177183, for example, but it does not work for me.
      Does anyone know another solution?

      Joe Witt
      43 minutes ago
      If you take nifi out of the equation and just unpack the zip using a command line tool - does it see the paths/names correctly?

      Joe Witt
      43 minutes ago
      is there a sample zip you can share which has this problem?

      Umar Hussain
      9 minutes ago
      We tried it with unzip on Linux and if we give parameter -O Cp347 the German characters ü ä ö in the path appear correctly in output.
      But a simple unzip command also doesn't produce correct paths in output.

      Joe Witt
      5 minutes ago
      Interesting. So if you tell the zip program the encoding is cp347 the output appears correct. otherwise it is incorrect yes?
      New

      Umar Hussain
      3 minutes ago
      Yes, I think its the encoding of zip and the zip was created on a windows machine and on Linux it's by default a different one.
      The processor current implementation takes the platforms default encoding

      Joe Witt
      3 minutes ago
      Yeah this is probably a good summary of behavior we need to consider. https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or

      Stack OverflowStack Overflow
      Correctly decoding zip entry file names – CP437, UTF-8 or?
      I recently wrote a zip file I/O library called zipzap, but I'm struggling with correctly decoding zip entry file names from arbitrary zip files.
      Now, the PKWARE spec states:
      D.1 The ZIP format ...

      Joe Witt
      2 minutes ago
      My guess is we need to allow the user to override the default behavior by selecting the character set we'll read the filenames/paths as in some cases of reading legacy app created zips

      Attachments

        Issue Links

          Activity

            People

              umar.hussain Umar Hussain
              joewitt Joe Witt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h