The ReplaceRegExp task throws IndexOutOfBoundsException for files containing multi-byte encodings. java.lang.IndexOutOfBoundsException at java.io.BufferedReader.read(BufferedReader.java:256) at org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.doReplace(ReplaceRegExp.java:404) at org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.execute(ReplaceRegExp.java:491) at org.apache.tools.ant.Task.perform(Task.java:319) at org.apache.tools.ant.Target.execute(Target.java:309) at org.apache.tools.ant.Target.performTasks(Target.java:336) at org.apache.tools.ant.Project.executeTarget(Project.java:1306) at org.apache.tools.ant.Project.executeTargets(Project.java:1250) at org.apache.tools.ant.Main.runBuild(Main.java:610) at org.apache.tools.ant.Main.start(Main.java:196) at org.apache.tools.ant.Main.main(Main.java:235) The task was: <replaceregexp flags="g" file="regtst"> <regexp pattern="((Header:\s+\S+|Revision)\s+\S+\s+\S+\s+\S+)\s+(\w+)"/> <substitution expression="\1"/> </replaceregexp> The root cause seems to be the assumption that the length of the file is the same as the number of characters in the file. This assumption fails for multi-byte encodings. ReplaceRegExp.java lines 398 to 406 are: int flen = (int) f.length(); char tmpBuf[] = new char[flen]; int numread = 0; int totread = 0; while (numread != -1 && totread < flen) { numread = br.read(tmpBuf, totread, flen); totread += numread; } The flen is the number of bytes in the file, but it's being misused as the number of characters. Related symptom: if you use a fileset, you don't get the full stacktrace, only a summary: [replaceregexp] An error occurred processing file: '/home/jdb/projects/foo/regtst': java.lang.IndexOutOfBoundsException Work around: byline="true" uses a different block of code. (But it's still apt to munge your encoding.) Suggested enhancement: add a file encoding parameter to the task. Sorry I don't have time to fix this right now. 11011011
There is an encoding attribute in 1.6, but the code has been changed significantly between 1.5.1 and 1.5.3 so chances are very good that it is going to work with 1.5.3 already - at least as long as the multi-byte encoding is your native file encoding.
It must still be broken in 1.5.3, because there were no code changes in ReplaceRegExp.java from 1.5.1. But I don't know about 1.6, and an encoding attribute sounds good!
You are correct, the changes didn't get merged over. Moreover, I just rechecked CVS HEAD and realized that the branch where the byline has been set to false still has this problem for multibyte encodings and - even worse - ignores the encoding attribute.
OK, should be fixed in nightly build 2003-04-25