Uploaded image for project: 'Xerces2-J'
  1. Xerces2-J
  2. XERCESJ-1668

Off-by-one bug w/ surrogates in UTF8Reader

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Other
    • Labels:
      None
    • Flags:
      Patch

      Description

      There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one.

      This gets triggered when parsing the current enwiktionary dump file.

      org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.
      

      In the attached patch I added a fix + testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1.

      Is UTF8Reader still necessary? It might be safer to just use a plain InputStreamReader.

        Attachments

        1. surrogate.patch
          4 kB
          Jan Berkel

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jberkel Jan Berkel
              • Votes:
                3 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: