Bug 36851 - <tar> Task does not support multi-byte file names
<tar> Task does not support multi-byte file names
Status: NEW
Product: Ant
Classification: Unclassified
Component: Core tasks
1.5.4
PC Linux
: P2 normal with 1 vote (vote)
: ---
Assigned To: Ant Notifications List
:
: 41455 (view as bug list)
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2005-09-29 01:00 UTC by Daniel Rall
Modified: 2012-06-17 05:14 UTC (History)
2 users (show)



Attachments
Tweaks illustrating a couple of the problem areas (1.41 KB, patch)
2005-09-29 01:06 UTC, Daniel Rall
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Rall 2005-09-29 01:00:26 UTC
File names which contain multi-byte characters are not handled properly by the
supporting classes of Ant's <tar> Task.  These classes treat characters as
bytes, which is not a valid assumption for non-ASCII characters (e.g. Japanese,
Korean, Chinese, etc.).  This problem was first noticed against the 1.5.4
release, but I've determined that it exists in today's HEAD, and has probably
existed since Stefano/Conor first added this code.


Steps to reproduce:
===================
1. Create a file whose name contains Japanese characters, encoded as UTF-8.
2. Invoke the <tar> task on that file, either programmatically or via a
build.xml file, using the GNU tar long file names extension longfile="gnu".


Observed behavior:
==================
You're greated with the error message:

Problem creating TAR: request to write '125' bytes exceeds size in header of
'104' bytes

Re-running Ant with -verbose/-debug will produce the following stack trace
(using Ant 1.5.4):

--- Nested Exception ---
java.io.IOException: request to write '125' bytes exceeds size in header of
'104' bytes
        at org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:274)
        at org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:256)
        at
org.apache.tools.tar.TarOutputStream.putNextEntry(TarOutputStream.java:184)
        at org.apache.tools.ant.taskdefs.Tar.tarFile(Tar.java:410)
        at org.apache.tools.ant.taskdefs.Tar.execute(Tar.java:322)

(I'll attach a patch against HEAD which shows a couple of the problem areas, but
is untested.)


Expected behavior:
==================
As with GNU tar, archiving and unpacking of this data should be handled without
error.


Current work-around:
====================
Fork gtar from Ant.
Comment 1 Daniel Rall 2005-09-29 01:06:58 UTC
Created attachment 16547 [details]
Tweaks illustrating a couple of the problem areas

Much like the file name used in TarEntry.writeEntryHeader(), the userId,
groupId, linkName, magic, userName, and groupName may need similar adjustments
to avoid possible erroneous one-to-one mappings from their characters to bytes.


This patch assumes that the getBytes() call will use a character encoding like
UTF-8, and that the file names are using the same encoding.  This assumption
will likely often be false for the use case described by this issue.
Comment 2 Stefan Bodewig 2008-09-16 03:04:13 UTC
*** Bug 41455 has been marked as a duplicate of this bug. ***
Comment 3 Stefan Bodewig 2012-06-17 05:14:26 UTC
With svn revision 1350857 the infrastructure for proper handling of encoding is there, the task doesn't use it, though.

The only real way to handle characters other than ASCII is to use PAX extension headers (which explicitly use UTF-8), any other approach is non-portable or will only work if filename-encoding of source and target machine match.