Details
-
Improvement
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
0.6.0
Description
When i use a file encoded with ISO-8859-1 with a caracter é i got an exception like :
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061, byte #1012) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001) ... 19 more
Encoding is hardcoded :
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342
It would be great if i can specify it like :
XmlSource.from[MyClass](input) .withRootElement("ROOT_ELEMENT") .withRecordElement("MyClass") .withRecordClass(classOf[MyClass]) .withCharset(StandardCharsets.ISO_8859_1)
I can provide a pull request if you want