Bug 19187 - ReplaceRegExp cannot handle multi-byte encodings
Summary: ReplaceRegExp cannot handle multi-byte encodings
Status: RESOLVED FIXED
Alias: None
Product: Ant
Classification: Unclassified
Component: Optional Tasks (show other bugs)
Version: 1.5.3
Hardware: Other other
: P3 normal (vote)
Target Milestone: 1.6
Assignee: Ant Notifications List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-04-21 07:47 UTC by J. David Beutel
Modified: 2004-11-16 19:05 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description J. David Beutel 2003-04-21 07:47:50 UTC
The ReplaceRegExp task throws IndexOutOfBoundsException for files containing
multi-byte encodings.

java.lang.IndexOutOfBoundsException
        at java.io.BufferedReader.read(BufferedReader.java:256)
        at
org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.doReplace(ReplaceRegExp.java:404)
        at
org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.execute(ReplaceRegExp.java:491)
        at org.apache.tools.ant.Task.perform(Task.java:319)
        at org.apache.tools.ant.Target.execute(Target.java:309)
        at org.apache.tools.ant.Target.performTasks(Target.java:336)
        at org.apache.tools.ant.Project.executeTarget(Project.java:1306)
        at org.apache.tools.ant.Project.executeTargets(Project.java:1250)
        at org.apache.tools.ant.Main.runBuild(Main.java:610)
        at org.apache.tools.ant.Main.start(Main.java:196)
        at org.apache.tools.ant.Main.main(Main.java:235)

The task was:
    <replaceregexp flags="g" file="regtst">
        <regexp pattern="((Header:\s+\S+|Revision)\s+\S+\s+\S+\s+\S+)\s+(\w+)"/>
        <substitution expression="\1"/>
      </replaceregexp>

The root cause seems to be the assumption that the length of the file is the
same as the number of characters in the file.  This assumption fails for
multi-byte encodings.  ReplaceRegExp.java lines 398 to 406 are: 
                int flen = (int) f.length();
                char tmpBuf[] = new char[flen]; 
                int numread = 0;
                int totread = 0; 

                while (numread != -1 && totread < flen) {
                    numread = br.read(tmpBuf, totread, flen);
                    totread += numread; 
                }
The flen is the number of bytes in the file, but it's being misused as the
number of characters.

Related symptom: if you use a fileset, you don't get the full stacktrace, only a
summary:
[replaceregexp] An error occurred processing file:
'/home/jdb/projects/foo/regtst': java.lang.IndexOutOfBoundsException

Work around:
byline="true" uses a different block of code.  (But it's still apt to munge your
encoding.)

Suggested enhancement:
add a file encoding parameter to the task.

Sorry I don't have time to fix this right now.

11011011
Comment 1 Stefan Bodewig 2003-04-23 09:44:47 UTC
There is an encoding attribute in 1.6, but the code has been changed significantly
between 1.5.1 and 1.5.3 so chances are very good that it is going to work with
1.5.3 already - at least as long as the multi-byte encoding is your native
file encoding.
Comment 2 J. David Beutel 2003-04-24 05:59:47 UTC
It must still be broken in 1.5.3, because there were no code changes in
ReplaceRegExp.java from 1.5.1.  

But I don't know about 1.6, and an encoding attribute sounds good!
Comment 3 Stefan Bodewig 2003-04-24 06:12:49 UTC
You are correct, the changes didn't get merged over.

Moreover, I just rechecked CVS HEAD and realized that the branch where the
byline has been set to false still has this problem for multibyte encodings
and - even worse - ignores the encoding attribute.
Comment 4 Stefan Bodewig 2003-04-24 09:13:13 UTC
OK, should be fixed in nightly build 2003-04-25