Uploaded image for project: 'Commons Compress'
  1. Commons Compress
  2. COMPRESS-51

Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Any / All

      Description

      Currently it is not possible to generate externally readable ZIP archives with java.util.zip.* or org.apache.commons.compress.* when entries to include shall have names with characters outside US-ASCII. This should be changed to enable at least org.apache.commons.compress.* to produce ZIP archives in international context which are readable by usual ZIP archiver tools like pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...

      For java.util.zip.* this is due to a really old flaw on handling entry names: They are just always rendered as UTF-8, which is kind of Java specific, and not as Cp437, which is expected and written by most ZIP archiver tools (or eventually all). For more details see:

      http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4244499
      http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4820807

      For org.apache.commons.compress.archivers.zip.* the "compress & save" operation can be easily improved by extending ZipArchive:

      // Add member:

      protected String m_encoding = null;

      // Add constructor:

      public ZipArchive(String encoding)

      { m_encoding = encoding; }

      // Extend doSave(FileOutputStream):
      // ...
      // Pack-Operation
      ZipOutputStream out = null;
      try {
      out = new ZipOutputStream(new BufferedOutputStream(output));
      if (m_encoding != null)

      { // added out.setEncoding(m_encoding); // added }

      // added
      while(iterator.hasNext()) {
      // ...

      Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and external tools can figure out the original entry names even if they contain non-ASCII characters. (On the other hand, Java cannot read back & deflate such an archive since it expects UTF-8!)

      The "read & deflate" operation for ZipArchive is more difficult to extend since it currently relies completely on java.util.zip.* . The other reason is, that ZIP archives do not contain any hint on the character encoding used for file names etc. It seems that the usual tools simply use Cp437 and Java simply uses UTF-8 – without any declaration of reasons. Thus a deflater has to try.

      For TarArchive the problem is unclear. Here the commons-compress implementation does not rely on third-party code as far as I can see, and TAR is no Java-bound file type (like JAR, which is Java-bound). Thus chances are, that everything works well, even when entry names with non-ASCII characters come into play.

        Attachments

        1. utf8-winzip-test.zip
          0.6 kB
          Wolfgang Glas
        2. utf8-7zip-test.zip
          0.4 kB
          Wolfgang Glas
        3. commons-compress-utf8-creation-svn741897.patch
          33 kB
          Wolfgang Glas

          Activity

            People

            • Assignee:
              bodewig Stefan Bodewig
              Reporter:
              zipwiz Christian Gosch
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: