Commons Compress
  1. Commons Compress
  2. COMPRESS-114

TarUtils.parseName does not properly handle characters outside the range 0-127

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      Windows/Suse

      Description

      if a tarfile contains files with special characters, the names of the tar entries are wrong.

      example:
      correct name: 0302-0601-3±±±F06±W220±ZB±LALALA±±±±±±±±±±CAN±±DC±±±04±060302±MOE.model
      name resolved by TarUtils.parseName: 0302-0101-3ᄆᄆᄆF06ᄆW220ᄆZBᄆHECKMODULᄆᄆᄆᄆᄆᄆᄆᄆᄆᄆECEᄆᄆDCᄆᄆᄆ07ᄆ060302ᄆDOERN.model

      please use:
      result.append(new String(new byte[]

      { buffer[i] }

      ));

      instead of:
      result.append((char) buffer[i]);

      to solve this encoding problem.

      1. plusMinusForJIRAwithLicense.tar
        10 kB
        Helmut Minst
      2. TarUtils.java
        9 kB
        Helmut Minst
      3. TarArchiveInputStream.java
        11 kB
        Helmut Minst
      4. TarArchiveEntry.java
        20 kB
        Helmut Minst

        Activity

        Hide
        Sebb added a comment -

        Could you provide a small sample tar file containing some files with the special names?
        [Perhaps the file contents could be the expected name]

        We can then add the file to the test cases.

        Thanks!

        Show
        Sebb added a comment - Could you provide a small sample tar file containing some files with the special names? [Perhaps the file contents could be the expected name] We can then add the file to the test cases. Thanks!
        Hide
        Sebb added a comment -

        Forgot to mention:

        The "String(byte[])" constructor depends on the default charset encoding, which might not always be what is wanted.

        Show
        Sebb added a comment - Forgot to mention: The "String(byte[])" constructor depends on the default charset encoding, which might not always be what is wanted.
        Hide
        Helmut Minst added a comment -

        tarfile which includes such files with a filename containing special characters

        Show
        Helmut Minst added a comment - tarfile which includes such files with a filename containing special characters
        Hide
        Helmut Minst added a comment -

        that's right, but the cast from byte to char is not a beautiful way.

        String charsetName = "ISO-8859-1";
        try {
        result.append(new String(new byte[]

        { buffer[i] }, charsetName));
        } catch (UnsupportedEncodingException e) {
        result.append(new String(new byte[] { buffer[i] }

        ));
        }

        where charsetName may be set via a system property or sth else by the customer of commons compress.

        Show
        Helmut Minst added a comment - that's right, but the cast from byte to char is not a beautiful way. String charsetName = "ISO-8859-1"; try { result.append(new String(new byte[] { buffer[i] }, charsetName)); } catch (UnsupportedEncodingException e) { result.append(new String(new byte[] { buffer[i] } )); } where charsetName may be set via a system property or sth else by the customer of commons compress.
        Hide
        Helmut Minst added a comment -

        example where charset may be set by a setter in TarArchiveInpustream.
        if an UnsupportedEncodingException occurs, the default charset of the system
        is used.

        Please have a look. Hope this helps to solve the problem.

        Show
        Helmut Minst added a comment - example where charset may be set by a setter in TarArchiveInpustream. if an UnsupportedEncodingException occurs, the default charset of the system is used. Please have a look. Hope this helps to solve the problem.
        Hide
        Sebb added a comment -

        Thanks for the test case - unfortunately you did not grant a license to the ASF to use it.
        Could you re-attach it with the option selected please?

        Show
        Sebb added a comment - Thanks for the test case - unfortunately you did not grant a license to the ASF to use it. Could you re-attach it with the option selected please?
        Hide
        Helmut Minst added a comment -

        same file, but now the license is available

        Show
        Helmut Minst added a comment - same file, but now the license is available
        Hide
        Sebb added a comment -

        As to how to determine the charset, it looks as though "ASCII" or "ISO-8859-1" are suitable as the default.
        Gnu docs mention "local variant of ASCII".

        Show
        Sebb added a comment - As to how to determine the charset, it looks as though "ASCII" or "ISO-8859-1" are suitable as the default. Gnu docs mention "local variant of ASCII".
        Hide
        Helmut Minst added a comment -

        yes, i think "ISO-8859-1" seems to be the suitable default value for the charset, too.

        i've tested it with several special characters

        äöü in german for example

        Show
        Helmut Minst added a comment - yes, i think "ISO-8859-1" seems to be the suitable default value for the charset, too. i've tested it with several special characters äöü in german for example
        Hide
        Sebb added a comment -

        I think there may also be a problem with the TarUtils.formatNameBytes() method, which assumes that String.charAt() can be stored in a byte.

        I'll add some round-trip tests for formatNameBytes() / parseName()

        Show
        Sebb added a comment - I think there may also be a problem with the TarUtils.formatNameBytes() method, which assumes that String.charAt() can be stored in a byte. I'll add some round-trip tests for formatNameBytes() / parseName()
        Hide
        Sebb added a comment -

        Turned out to be easy to fix the round-trip problem - just ensure that the byte entries are treated as unsigned.
        So no need to worry about charsets.

        Still need to check that this works OK when reading from the test tar file.

        Show
        Sebb added a comment - Turned out to be easy to fix the round-trip problem - just ensure that the byte entries are treated as unsigned. So no need to worry about charsets. Still need to check that this works OK when reading from the test tar file.
        Hide
        Sebb added a comment -

        Now fixed; the test tar file reads OK

        Show
        Sebb added a comment - Now fixed; the test tar file reads OK
        Hide
        Helmut Minst added a comment -

        I've had a look on your solution. This is a better way to solve this Problem.
        Thanks a lot!

        Show
        Helmut Minst added a comment - I've had a look on your solution. This is a better way to solve this Problem. Thanks a lot!
        Hide
        Pavel added a comment - - edited

        Hello,

        I've checked out the trunk from http://svn.apache.org/repos/asf/commons/proper/compress and run the testRoundTripNames() test from TarUtilsTest. It failed (the last checkName() call with spec. characters). The test was performed on Ubuntu 8.10.

        Has the fix been tested on Linux? In which version can find the final fix to this special characters problem?

        Thanks

        Show
        Pavel added a comment - - edited Hello, I've checked out the trunk from http://svn.apache.org/repos/asf/commons/proper/compress and run the testRoundTripNames() test from TarUtilsTest. It failed (the last checkName() call with spec. characters). The test was performed on Ubuntu 8.10. Has the fix been tested on Linux? In which version can find the final fix to this special characters problem? Thanks
        Hide
        Stefan Bodewig added a comment -

        The test passes for me using Ubuntu 10.4 and OpenJDK 6 - I guess it may even more depend on the Java VM than the OS. Which flavor of Java are you using, Pavel?

        Show
        Stefan Bodewig added a comment - The test passes for me using Ubuntu 10.4 and OpenJDK 6 - I guess it may even more depend on the Java VM than the OS. Which flavor of Java are you using, Pavel?
        Hide
        Pavel added a comment - - edited

        Stefan, thanks for a swift reply,

        I'm using Sun JDK 1.6.0_14_b08, but I've just tried it with OpenJDK 1.6.0_0-b12 and have the same result...

        In case it helps: I've checked out the trunc using Eclipse 3.6 (Subversive plugin) and build it using Maven2 plugin

        Do you know where I can get a commons-compress.jar (1.1) distro?

        thx

        Show
        Pavel added a comment - - edited Stefan, thanks for a swift reply, I'm using Sun JDK 1.6.0_14_b08, but I've just tried it with OpenJDK 1.6.0_0-b12 and have the same result... In case it helps: I've checked out the trunc using Eclipse 3.6 (Subversive plugin) and build it using Maven2 plugin Do you know where I can get a commons-compress.jar (1.1) distro? thx
        Hide
        Stefan Bodewig added a comment -

        A snapshot I compiled myself can be found at http://people.apache.org/~bodewig/commons-compress-1.1-SNAPSHOT.jar and I'll remove it once 1.1 has been released.

        The unit tests pass for me on my Ubuntu system and it's pretty likely it is more of an environment setting thing. I may also note that the tests pass in the Apache Gump builds both on Linux (Ubuntu 8.4) and Solaris 10.

        Returning to the original problem, commons-compress really doesn't implement POSIX tar or even comes close to it. It mostly lives at the least common denominator of all tar dialects, ustar. And this means the only characters that are really supported come from the seven bit ASCII set - with anything else you can only hope it works.

        Show
        Stefan Bodewig added a comment - A snapshot I compiled myself can be found at http://people.apache.org/~bodewig/commons-compress-1.1-SNAPSHOT.jar and I'll remove it once 1.1 has been released. The unit tests pass for me on my Ubuntu system and it's pretty likely it is more of an environment setting thing. I may also note that the tests pass in the Apache Gump builds both on Linux (Ubuntu 8.4) and Solaris 10. Returning to the original problem, commons-compress really doesn't implement POSIX tar or even comes close to it. It mostly lives at the least common denominator of all tar dialects, ustar. And this means the only characters that are really supported come from the seven bit ASCII set - with anything else you can only hope it works.
        Hide
        Helmut Minst added a comment -

        Hi,

        I've tested it at the time when the bug was fixed by Sebb on following OS:

        • Ubuntu
        • Windows
        • Mac

        and the solution worked fine.

        grz

        Show
        Helmut Minst added a comment - Hi, I've tested it at the time when the bug was fixed by Sebb on following OS: Ubuntu Windows Mac and the solution worked fine. grz
        Hide
        Helmut Minst added a comment -

        see last comment. issue was tested on several OS

        Show
        Helmut Minst added a comment - see last comment. issue was tested on several OS

          People

          • Assignee:
            Unassigned
            Reporter:
            Helmut Minst
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development