Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2060

XmlIO use harcoded Charset

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 0.6.0
    • 2.0.0
    • sdk-java-core

    Description

      When i use a file encoded with ISO-8859-1 with a caracter é i got an exception like :

      Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061, byte #1012)
      	at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
      	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
      	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
      	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
      	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
      	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
      	... 19 more
      

      Encoding is hardcoded :

      https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
      https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
      https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342

      It would be great if i can specify it like :

      XmlSource.from[MyClass](input)
            .withRootElement("ROOT_ELEMENT")
            .withRecordElement("MyClass")
            .withRecordClass(classOf[MyClass])
            .withCharset(StandardCharsets.ISO_8859_1)
      

      I can provide a pull request if you want

      Attachments

        Activity

          People

            jbonofre Jean-Baptiste Onofré
            dgouyette Damien GOUYETTE
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: