Commons Compress
  1. Commons Compress
  2. COMPRESS-176

ArchiveInputStream#getNextEntry(): Problems with WinZip directories with Umlauts

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: Archivers
    • Labels:
      None
    • Environment:

      Windows 7

      Description

      There is a problem when handling a WinZip-created zip with Umlauts in directories.

      I'm accessing a zip file created with WinZip containing a directory with an umlaut ("ä") with ArchiveInputStream. When creating the zip file the unicode-flag of winzip had been active.

      The following problem occurs when accessing the entries of the zip:
      the ArchiveEntry for a directory containing an umlaut is not marked as a directory and the file names for the directory and all files contained in that directory contain backslashes instead of slashes (i.e. completely different to all other files in directories with no umlaut in their path).

      There is no difference when letting the ArchiveStreamFactory decide which ArchiveInputStream to create or when using the ZipArchiveInputStream constructor with the correct encoding (I've tried different encodings CP437, CP850, ISO-8859-15, but still the problem persisted).

      This problem does not occur when using the very same zip file but compressed by 7zip or the built-in Windows 7 zip functionality.

      1. MkZip.java
        0.6 kB
        Stefan Bodewig
      2. test-7zip.zip
        0.8 kB
        Wurstbrot mit Senf
      3. test-doublevertical.zip
        0.3 kB
        Stefan Bodewig
      4. test-windows.zip
        0.7 kB
        Wurstbrot mit Senf
      5. test-winzip.zip
        0.9 kB
        Wurstbrot mit Senf
      6. testzap-winzip.zip
        0.9 kB
        Sebb

        Activity

        Hide
        Sebb added a comment -

        Could you attach minimal sample archives which show the problem?

        Show
        Sebb added a comment - Could you attach minimal sample archives which show the problem?
        Hide
        Wurstbrot mit Senf added a comment -

        Minimum test zip attached.

        Show
        Wurstbrot mit Senf added a comment - Minimum test zip attached.
        Hide
        Sebb added a comment -

        Thanks, but you have not granted the ASF licence to use the file, which means we cannot include it in our test suite.

        Please could you delete and reattach it?

        Also, we will need the equivalent 7zip and Win7 archives for comparison.

        Show
        Sebb added a comment - Thanks, but you have not granted the ASF licence to use the file, which means we cannot include it in our test suite. Please could you delete and reattach it? Also, we will need the equivalent 7zip and Win7 archives for comparison.
        Hide
        Wurstbrot mit Senf added a comment -

        re-added winzip zip file plus identical ones packed with 7zip and windows built-in zip facility.

        Show
        Wurstbrot mit Senf added a comment - re-added winzip zip file plus identical ones packed with 7zip and windows built-in zip facility.
        Hide
        Sebb added a comment -

        Thanks.

        I'm beginning to wonder if Winzip is faulty.
        The unicode filename that is stored uses \ whereas the base name uses /.

        Show
        Sebb added a comment - Thanks. I'm beginning to wonder if Winzip is faulty. The unicode filename that is stored uses \ whereas the base name uses /.
        Hide
        Wurstbrot mit Senf added a comment -

        Btw.: I have no problems to handle this jar using java.util.zip (Java 6) for some reason

        Show
        Wurstbrot mit Senf added a comment - Btw.: I have no problems to handle this jar using java.util.zip (Java 6) for some reason
        Hide
        Stefan Bodewig added a comment -

        This is what InfoZIP's zip on Linux says:

        stefanb@brick:~$ zip -Tv Desktop/test-winzip.zip 
        Archive:  Desktop/test-winzip.zip
            testing: doc.txt.gz               OK
            testing: doc2.txt                 OK
            testing: ??\                      OK
            testing: ??\??zip.zip             OK
            testing: ??\??.txt                OK
        No errors detected in compressed data of Desktop/test-winzip.zip.
        test of Desktop/test-winzip.zip OK
        

        The entry for the directory contains a Unicode extra field with 0xc3 0xa4 0x5c as UTF-8 encoded name. This actually is "ä\".

        Since directory names in ZIP archives must end with "/" Compress doesn't detect this as a directory. It may be possible to create a workaround like "if the 'plain name ends with a / and the unicode name uses a \ then bend it", but I can't say I'd like that.

        Java6 likely works because it doesn't have any idea about unicode extra fields and simply uses the "plain" name. You'd get the same behavior from ZipArchiveInputStream by setting useUnicodeExtraFields to false in the constructor.

        Show
        Stefan Bodewig added a comment - This is what InfoZIP's zip on Linux says: stefanb@brick:~$ zip -Tv Desktop/test-winzip.zip Archive: Desktop/test-winzip.zip testing: doc.txt.gz OK testing: doc2.txt OK testing: ??\ OK testing: ??\??zip.zip OK testing: ??\??.txt OK No errors detected in compressed data of Desktop/test-winzip.zip. test of Desktop/test-winzip.zip OK The entry for the directory contains a Unicode extra field with 0xc3 0xa4 0x5c as UTF-8 encoded name. This actually is "ä\". Since directory names in ZIP archives must end with "/" Compress doesn't detect this as a directory. It may be possible to create a workaround like "if the 'plain name ends with a / and the unicode name uses a \ then bend it", but I can't say I'd like that. Java6 likely works because it doesn't have any idea about unicode extra fields and simply uses the "plain" name. You'd get the same behavior from ZipArchiveInputStream by setting useUnicodeExtraFields to false in the constructor.
        Hide
        Sebb added a comment -

        The plain names use / and look OK when using CP437.

        For some odd reason, the unicode extra fields use \ instead of /
        I think that may be a Winzip bug - it does not make sense to use a different separator for the extra fields.

        To confirm this is a bug, it would be useful to see how other zip tools use the extra fields - are there any?
        Apart from Ant or other code based on Commons Compress, of course!

        Alternatively, find some documentation as to the correct contents of the field.

        My version of Winzip is too old to support the fields; if you have purchased a more recent one perhaps you could e-mail their support desk?

        A possible work-round would be to make the \ => / behaviour optional; I agree we should not do this by default

        Show
        Sebb added a comment - The plain names use / and look OK when using CP437. For some odd reason, the unicode extra fields use \ instead of / I think that may be a Winzip bug - it does not make sense to use a different separator for the extra fields. To confirm this is a bug, it would be useful to see how other zip tools use the extra fields - are there any? Apart from Ant or other code based on Commons Compress, of course! Alternatively, find some documentation as to the correct contents of the field. My version of Winzip is too old to support the fields; if you have purchased a more recent one perhaps you could e-mail their support desk? A possible work-round would be to make the \ => / behaviour optional; I agree we should not do this by default
        Hide
        Stefan Bodewig added a comment -

        AFAIK what we have written down based on findings by Wolfgang Glas in http://commons.apache.org/compress/zip.html still stands, WinZIP is the only one using Unicode extra fields, all other implementations have switched to the language encoding flag. The only exceptions are Windows compressed folders - which doesn't understand either - and InfoZIP based tools if they are compiled to use the extra fields.

        A question to the original reporter (I'm German so I know the name's a fake 8-): since you also have an installation of 7zip, what does 7zip think of your WinZIP created archive?

        Show
        Stefan Bodewig added a comment - AFAIK what we have written down based on findings by Wolfgang Glas in http://commons.apache.org/compress/zip.html still stands, WinZIP is the only one using Unicode extra fields, all other implementations have switched to the language encoding flag. The only exceptions are Windows compressed folders - which doesn't understand either - and InfoZIP based tools if they are compiled to use the extra fields. A question to the original reporter (I'm German so I know the name's a fake 8-): since you also have an installation of 7zip, what does 7zip think of your WinZIP created archive?
        Hide
        Sebb added a comment -

        I have 7zip installed, and it reads the archive OK.

        However, I don't think that proves anything, since the plain names are correct.

        I guess we could look at the 7zip source code to see if it uses the extra fields.

        A better test would be to create a zip file using a filename that cannot be represented in CP437, i.e. only the extra field would show the correct name.

        Show
        Sebb added a comment - I have 7zip installed, and it reads the archive OK. However, I don't think that proves anything, since the plain names are correct. I guess we could look at the 7zip source code to see if it uses the extra fields. A better test would be to create a zip file using a filename that cannot be represented in CP437, i.e. only the extra field would show the correct name.
        Hide
        Sebb added a comment -

        Copy of test-winzip.zip, but with plain file name changed from 3zip.zip to 3zap.zap.

        This shows only the plain file name in 7zip and in my copy of Winzip (9.0).

        This suggests that neither is processing the unicode extra fields.

        Show
        Sebb added a comment - Copy of test-winzip.zip, but with plain file name changed from 3zip.zip to 3zap.zap. This shows only the plain file name in 7zip and in my copy of Winzip (9.0). This suggests that neither is processing the unicode extra fields.
        Hide
        Stefan Bodewig added a comment -

        OK, this means nobody except for Commons Compress and InfoZIP tools seems to read the Unicode extra field.

        This is what I get when trying to extract the original ZIP on Linux:

        stefan@birdy:~/Desktop$ unzip test-winzip.zip 
        Archive:  test-winzip.zip
          inflating: doc.txt.gz              
         extracting: doc2.txt                
        warning:  test-winzip.zip appears to use backslashes as path separators
           creating: ??/
          inflating: ??/??zip.zip            
         extracting: ??/??.txt  
        

        and it creates an "ä" directory. I'll try to look through InfoZIPs sources what it bases it heuristics on, maybe we can use the same in Commons Compress to turn backslashes into slashes.

        Show
        Stefan Bodewig added a comment - OK, this means nobody except for Commons Compress and InfoZIP tools seems to read the Unicode extra field. This is what I get when trying to extract the original ZIP on Linux: stefan@birdy:~/Desktop$ unzip test-winzip.zip Archive: test-winzip.zip inflating: doc.txt.gz extracting: doc2.txt warning: test-winzip.zip appears to use backslashes as path separators creating: ??/ inflating: ??/??zip.zip extracting: ??/??.txt and it creates an "ä" directory. I'll try to look through InfoZIPs sources what it bases it heuristics on, maybe we can use the same in Commons Compress to turn backslashes into slashes.
        Hide
        Stefan Bodewig added a comment -

        In extract.c of unzip60 line 1310ff there is this code that replaces backslashes with slashes. It only replaces them in names that don't contain forward slashes (MBSCHR looks up a character in a character array) and only if "hostnum" indicates a FAT system.

                    /* for files from DOS FAT, check for use of backslash instead
                     *  of slash as directory separator (bug in some zipper(s); so
                     *  far, not a problem in HPFS, NTFS or VFAT systems)
                     */
        #ifndef SFX
                    if (G.pInfo->hostnum == FS_FAT_ && !MBSCHR(G.filename, '/')) {
                        char *p=G.filename;
        
                        if (*p) do {
                            if (*p == '\\') {
                                if (!G.reported_backslash) {
                                    Info(slide, 0x21, ((char *)slide,
                                      LoadFarString(BackslashPathSep), G.zipfn));
                                    G.reported_backslash = TRUE;
                                    if (!error_in_archive)
                                        error_in_archive = PK_WARN;
                                }
                                *p = '/';
                            }
                        } while (*PREINCSTR(p));
                    }
        #endif /* !SFX */
        

        "hostnum" is the upper byte of "version made by" inside the central directory header - this is ZipArchiveEntry's get/setPlatform - and FS_FAT_ is 0 (ZipArchiveEntry#PLATFORM_FAT). We'd have all pieces together to emulate this.

        Show
        Stefan Bodewig added a comment - In extract.c of unzip60 line 1310ff there is this code that replaces backslashes with slashes. It only replaces them in names that don't contain forward slashes (MBSCHR looks up a character in a character array) and only if "hostnum" indicates a FAT system. /* for files from DOS FAT, check for use of backslash instead * of slash as directory separator (bug in some zipper(s); so * far, not a problem in HPFS, NTFS or VFAT systems) */ #ifndef SFX if (G.pInfo->hostnum == FS_FAT_ && !MBSCHR(G.filename, '/')) { char *p=G.filename; if (*p) do { if (*p == '\\') { if (!G.reported_backslash) { Info(slide, 0x21, ((char *)slide, LoadFarString(BackslashPathSep), G.zipfn)); G.reported_backslash = TRUE; if (!error_in_archive) error_in_archive = PK_WARN; } *p = '/'; } } while (*PREINCSTR(p)); } #endif /* !SFX */ "hostnum" is the upper byte of "version made by" inside the central directory header - this is ZipArchiveEntry's get/setPlatform - and FS_FAT_ is 0 (ZipArchiveEntry#PLATFORM_FAT). We'd have all pieces together to emulate this.
        Hide
        Sebb added a comment -

        Excellent!
        Since \ and / are not allowed in file or folder names on Windows systems, there should be no case where a \ is incorrectly replaced.
        And it would still work if Winzip fixes its implementation to use /, and would also work with other applications that use / for the extra fields.

        ==

        There's still potentially the reverse problem - can Winzip handle / in the unicode extra field, or does it expect only \ ?
        If so, then I guess we might need to make the generated extra fields configurable to use \.
        I don't have the required version of Winzip to check that.

        Show
        Sebb added a comment - Excellent! Since \ and / are not allowed in file or folder names on Windows systems, there should be no case where a \ is incorrectly replaced. And it would still work if Winzip fixes its implementation to use /, and would also work with other applications that use / for the extra fields. == There's still potentially the reverse problem - can Winzip handle / in the unicode extra field, or does it expect only \ ? If so, then I guess we might need to make the generated extra fields configurable to use \. I don't have the required version of Winzip to check that.
        Hide
        Stefan Bodewig added a comment -

        Whether we need forward slashes in Unicode extra fields can only be answered by somebody using WinZIP. The best would be creating a test archive with a directory that contains a character in its name that is not part of CP437 - and to be safe not part of the platform's default encoding either.

        Show
        Stefan Bodewig added a comment - Whether we need forward slashes in Unicode extra fields can only be answered by somebody using WinZIP. The best would be creating a test archive with a directory that contains a character in its name that is not part of CP437 - and to be safe not part of the platform's default encoding either.
        Hide
        Stefan Bodewig added a comment -

        Workaround and tests are in svn revision 1294460

        I'll look into creating a test archive for the opposite direction today.

        Show
        Stefan Bodewig added a comment - Workaround and tests are in svn revision 1294460 I'll look into creating a test archive for the opposite direction today.
        Hide
        Wurstbrot mit Senf added a comment -

        Hi all, sounds promising. Thanks a lot, I'm looking forward to the next release.

        And by the way, how could you tell that the name's a fake?

        Show
        Wurstbrot mit Senf added a comment - Hi all, sounds promising. Thanks a lot, I'm looking forward to the next release. And by the way, how could you tell that the name's a fake?
        Hide
        Stefan Bodewig added a comment -

        The attached ZIP (created by the trivial attached class) contains a file named ‖.txt and its parent directory ‖ (that's a double vertical bar) using Unicode extra fields (and nothing else) and forward slashes.

        Can you please verify WinZIP is able to extract it?

        Show
        Stefan Bodewig added a comment - The attached ZIP (created by the trivial attached class) contains a file named ‖.txt and its parent directory ‖ (that's a double vertical bar) using Unicode extra fields (and nothing else) and forward slashes. Can you please verify WinZIP is able to extract it?
        Hide
        Wurstbrot mit Senf added a comment -

        Seems to be OK. Got a directory ‖ with the file ‖.txt in it.

        Show
        Wurstbrot mit Senf added a comment - Seems to be OK. Got a directory ‖ with the file ‖.txt in it.
        Hide
        Wurstbrot mit Senf added a comment -

        But 7Zip and windows built in zip both create a directory named %U2016 with a file named %U2016.txt in it.

        Show
        Wurstbrot mit Senf added a comment - But 7Zip and windows built in zip both create a directory named %U2016 with a file named %U2016.txt in it.
        Hide
        Stefan Bodewig added a comment -

        Great.

        I explicitly told ZipArchiveOutputStream to not use the language encoding flag to ensure WinZIP uses the Unicode extra field. Otherwise 7Zip would have worked. Windows Conmpressed Folders simply doesn't support file names with characters that are not part of the platform's namtive encoding.

        For a more complete discussion see http://commons.apache.org/compress/zip.html#encoding

        Show
        Stefan Bodewig added a comment - Great. I explicitly told ZipArchiveOutputStream to not use the language encoding flag to ensure WinZIP uses the Unicode extra field. Otherwise 7Zip would have worked. Windows Conmpressed Folders simply doesn't support file names with characters that are not part of the platform's namtive encoding. For a more complete discussion see http://commons.apache.org/compress/zip.html#encoding

          People

          • Assignee:
            Stefan Bodewig
            Reporter:
            Wurstbrot mit Senf
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development