Bug 51645 - CSVDataSet does not read UTF-8 files when file.encoding is UTF-8
CSVDataSet does not read UTF-8 files when file.encoding is UTF-8
Status: RESOLVED FIXED
Product: JMeter
Classification: Unclassified
Component: Main
2.4
All All
: P2 major (vote)
: ---
Assigned To: JMeter issues mailing list
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2011-08-10 19:48 UTC by Jacob Zwiers
Modified: 2011-08-11 00:37 UTC (History)
0 users



Attachments
Patch to fix issue. Variable not renamed to show just a matter of replacing class. (878 bytes, patch)
2011-08-10 19:48 UTC, Jacob Zwiers
Details | Diff
Test cases to expose bug. Run with file.encoding=UTF-8 (3.10 KB, patch)
2011-08-10 19:56 UTC, Jacob Zwiers
Details | Diff
.csv file for previous test patch (62 bytes, application/vnd.ms-excel)
2011-08-10 19:58 UTC, Jacob Zwiers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Zwiers 2011-08-10 19:48:30 UTC
Created attachment 27366 [details]
Patch to fix issue.  Variable not renamed to show just a matter of replacing class.

CSV Data Sets which are encoded in UTF-8 do not work on platforms where the default file.encoding is UTF-8.

UTF-8 is used to illustrate here, but this would presumably apply to other non-8bit character sets as well.

Reason:  The use of ByteArrayOutputStream in the CSVSaveService.csvReadFile() method.   Specifically, the boas.write(ch) call is implemented (internally in ByteArrayOutputStream) with a cast to the byte primitive type ( buf[count] = (byte)b; in my JVM).

Later, the ByteArrayOutputStream is interpreted according to the platform default (via baos.toString()) and if the content of the array are then interpreted according to the platform's default char set.  If that charset (eg. ISO-8859-1) is 8-bit, everything is fine.  However, unpredictable results/unmapped chars result for other charsets (like UTF-8).

For example, the character \u0027 (LATIN SMALL LETTER C WITH CEDILLA) with decimal code point 231.  When put into boas, it becomes (7 bit signed) -25.  When converted via toString() with UTF-8 as the default char set, the value is not recognized as a valid code point and the value \ufffd (decimal code point 65533 == Unicodes "REPLACEMENT CHARACTER" placeholder) is placed in the return string instead.

Fix: patche attached. Simply replace ByteArrayOutputStream with CharArrayWriter and the UTF-8 files work regardless of the value for file.encoding.
Comment 1 Jacob Zwiers 2011-08-10 19:56:50 UTC
Created attachment 27367 [details]
Test cases to expose bug. Run with file.encoding=UTF-8

Tests will execute successfully if default file.encoding is ISO-8859-1 (or other 8bit that can handle the chars in the test).  However, run with  -Dfile.encoding=UTF-8 VM arg and tests will fail.  Requires new bin/testfiles/testutf8.csv (attached next).
Comment 2 Jacob Zwiers 2011-08-10 19:58:17 UTC
Created attachment 27368 [details]
.csv file for previous test patch

Required for one of the previously attached tests. Belongs in bin/testfiles
Comment 3 Sebb 2011-08-11 00:37:08 UTC
Thanks very much.

Patch applied:

URL: http://svn.apache.org/viewvc?rev=1156416&view=rev
Log:
Bug 51645 - CSVDataSet does not read UTF-8 files when file.encoding is UTF-8

Added:
   jakarta/jmeter/trunk/bin/testfiles/testutf8.csv   (with props)
Modified:
   jakarta/jmeter/trunk/src/core/org/apache/jmeter/save/CSVSaveService.java
   jakarta/jmeter/trunk/test/src/org/apache/jmeter/config/TestCVSDataSet.java
   jakarta/jmeter/trunk/test/src/org/apache/jmeter/save/TestCSVSaveService.java
   jakarta/jmeter/trunk/xdocs/changes.xml