Lucene - Core
  1. Lucene - Core
  2. LUCENE-2016

replace invalid U+FFFF character during indexing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think).

      We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

      1. LUCENE-2016.patch
        2 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-2016.patch [ 12423593 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Michael McCandless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12480784 ] Default workflow, editable Closed status [ 12562660 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562660 ] jira [ 12583598 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development