Uploaded image for project: 'Abdera'
  1. Abdera
  2. ABDERA-222

Parse failures reading utf-8 xml files that have attribute values that contain non US-ASCII valid utf-8 characters

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      solarix x86_64, MaxOS Leopard x86_64, linux x86_64

      Description

      When parsing XML files that are items fetched by http-client 3.1

      The same items parse correctly, if written to a byte array and then a ByteArrayInputStream on the byte array, is passed to parse.
      parser.parse(response.getResponseBodyAsStream());

      Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode 0) encountered: not valid in any content
      at [row,col

      {unknown-source}

      ]: [3,56]
      at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:615)
      at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:644)
      at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4554)
      at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2886)
      at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
      at org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
      at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187)

      1. ChunkedTransferFailure.java
        9 kB
        Jason Venner (www.prohadoop.com)

        Activity

        Hide
        jasnell James M Snell added a comment -

        Fixed committed

        Show
        jasnell James M Snell added a comment - Fixed committed
        Hide
        jv_ning Jason Venner (www.prohadoop.com) added a comment -

        This code, when run against abdera 4.0 using HttpClient 3.1 demonstrates the chunked transfer multi-byte failures

        There are two examples in the code,
        one that places a multibyte character at position 0 in a chunk, the byte array rawChunkWithMultiByteAtStart
        and one that does not place a multbyte character at position 0 of any chunk.
        rawNoChunkWithMultiByteAtStart

        Show
        jv_ning Jason Venner (www.prohadoop.com) added a comment - This code, when run against abdera 4.0 using HttpClient 3.1 demonstrates the chunked transfer multi-byte failures There are two examples in the code, one that places a multibyte character at position 0 in a chunk, the byte array rawChunkWithMultiByteAtStart and one that does not place a multbyte character at position 0 of any chunk. rawNoChunkWithMultiByteAtStart
        Hide
        jv_ning Jason Venner (www.prohadoop.com) added a comment -

        In all faling cases, if I pass the parser an InputStreamReader( method.getRequestBodyAsStream(), "UTF-8"), the parse and element extraction is successful.

        This is definitely a bug in the new i18n code.

        Show
        jv_ning Jason Venner (www.prohadoop.com) added a comment - In all faling cases, if I pass the parser an InputStreamReader( method.getRequestBodyAsStream(), "UTF-8"), the parse and element extraction is successful. This is definitely a bug in the new i18n code.
        Hide
        jv_ning Jason Venner (www.prohadoop.com) added a comment -

        HttpClient is using a ChunkedInputStream under the covers, which forces no read to span a chunk boundary.
        The jetty server on the other side is arranging chunks so that the multi-byte characters, start the chunks.

        Show
        jv_ning Jason Venner (www.prohadoop.com) added a comment - HttpClient is using a ChunkedInputStream under the covers, which forces no read to span a chunk boundary. The jetty server on the other side is arranging chunks so that the multi-byte characters, start the chunks.
        Hide
        jv_ning Jason Venner (www.prohadoop.com) added a comment -

        This appears to trigger when the socket read boundaries fall such that the first byte of a multi byte character is the first byte in a read from the network socket.

        In our failing case, there are 3 reads issed against the input stream returned by the httpmethod.
        1 for 4 bytes
        1 for 196 bytes
        1 for 3800 bytes
        and then for 4 k bytes.

        In our failing case, the read for 196 bytes does returns less that 196 bytes, and the first character read in the next read is the start byte of our multibyte character.
        The multi-byte character is returned in the 3rd READ_ARRAY call and written to position 200 in the input buffer.
        When the mutli-byte character is not the first byte sequence returned by read, there is no exception.

        "TIME" "method" "read byte count" "read byte count after mark resets" "where read data is written into the buffer passed to read" "read request size" "count read"
        1238017735367 " AVAILABLE" 0 0 0 4 4
        1238017735367 "READ_ARRAY" 0 0
        1238017735367 " AVAILABLE" 4 4
        1238017735367 "READ_ARRAY" 4 4 4 196 158
        1238017735367 " AVAILABLE" 162 162
        1238017735367 "READ_ARRAY" 162 162 200 3800 2890
        1238017735370 " CLOSE" 3052 3052

        Show
        jv_ning Jason Venner (www.prohadoop.com) added a comment - This appears to trigger when the socket read boundaries fall such that the first byte of a multi byte character is the first byte in a read from the network socket. In our failing case, there are 3 reads issed against the input stream returned by the httpmethod. 1 for 4 bytes 1 for 196 bytes 1 for 3800 bytes and then for 4 k bytes. In our failing case, the read for 196 bytes does returns less that 196 bytes, and the first character read in the next read is the start byte of our multibyte character. The multi-byte character is returned in the 3rd READ_ARRAY call and written to position 200 in the input buffer. When the mutli-byte character is not the first byte sequence returned by read, there is no exception. "TIME" "method" "read byte count" "read byte count after mark resets" "where read data is written into the buffer passed to read" "read request size" "count read" 1238017735367 " AVAILABLE" 0 0 0 4 4 1238017735367 "READ_ARRAY" 0 0 1238017735367 " AVAILABLE" 4 4 1238017735367 "READ_ARRAY" 4 4 4 196 158 1238017735367 " AVAILABLE" 162 162 1238017735367 "READ_ARRAY" 162 162 200 3800 2890 1238017735370 " CLOSE" 3052 3052
        Hide
        merfifis Ronal Cori added a comment -

        interesting

        Show
        merfifis Ronal Cori added a comment - interesting

          People

          • Assignee:
            jasnell James M Snell
            Reporter:
            jv_ning Jason Venner (www.prohadoop.com)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development