Uploaded image for project: 'Commons VFS'
  1. Commons VFS
  2. VFS-637

Zip files with legacy encoding and special characters let VFS crash

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.3
    • Labels:
    • Environment:

      Windows 10 64 Bit, Java 8

    • Flags:
      Important

      Description

      Oracle has reworked the ZipFile object with Java 7. Since then the default constructor used by commons-vfs2 2.1 is more restrictive than with Java 6. The ZipFile constructor has got a second parameter (Charset) now for specification of the legacy charset to be used explicitly if the ZipFile doesn't state its UTF-8 compliance internally. This affects all ZIP files using a legacy charset for filename encoding but not using UTF-8 is it is common today. This could be a ZIP file with files containing german umlauts or russian characters in archived file's filenames, for example.

      To support this new parameter with (more or less) default values, the class org.apache.commons.vfs2.provider.zip.ZipFileSystem has to be extended by a default charset parameter, getter or setter (as you like) to forward this setting to the java.util.zip.ZipFile constructor.

      Quick workaround for me was to create a new OwnZipFileProvider referring to the even new OwnZipFileSystem (extending ZipFileSystem) with the following modified function. Change has been highlighted:

      {{ protected ZipFile createZipFile(final File file) throws FileSystemException {
      try {
      return new ZipFile(file, Charset.forName("IBM437"));
      } catch (final IOException ioe)

      { throw new FileSystemException("vfs.provider.zip/open-zip-file.error", file, ioe); }

      }
      }}

      Presetting to charset 437 as legacy default charset seems to be a a good workaround as stated in appendix D here: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT :

      "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. [...]"

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gschnepp Guido Schnepp
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified