Lucene - Core
  1. Lucene - Core
  2. LUCENE-2732

Fix charset problems in XML loading in HyphenationCompoundWordTokenFilter (also Solr's loader from schema)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As said in LUCENE-2731, the handling of XML in HyphenationCompoundWordTokenFilter is broken and breaks XML 1.0 (5th edition) spec totally. You should never supply a Reader to any XML api, unless you have internal character data (e.g. created programmatically). Also you should supply a system id, as resolving external entities does not work. The loader from files is much more broken, it always open the file as a Reader and then passes it to InputSource. Instead it should point filename directly to InputSource.

      This issue will fix it in trunk and use InputSource in Solr, but will still supply the Reader possibility in previous versions (deprecated).

      1. LUCENE-2732.patch
        19 kB
        Uwe Schindler
      2. LUCENE-2732.patch
        22 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Patch that fixes XML parsing, to apply first move the hyphenation.dtd to the src/resource/ folder.

          This patch also removed the hardcoded DTD from the parser and moves it to the resources folder (loaded by classloader). Solr is fixed to use the InputSource API, but it shozuld really use an URL, so ResourceLoader in Solr should be fixed to also supply URLs like ClassLoader!

          Show
          Uwe Schindler added a comment - Patch that fixes XML parsing, to apply first move the hyphenation.dtd to the src/resource/ folder. This patch also removed the hardcoded DTD from the parser and moves it to the resources folder (loaded by classloader). Solr is fixed to use the InputSource API, but it shozuld really use an URL, so ResourceLoader in Solr should be fixed to also supply URLs like ClassLoader!
          Hide
          Uwe Schindler added a comment -

          Updated patch, uses Locale.ENGLISH as noted by Robert. Also leaves the DTD in place (Solr+Lucene tests, but it is never parsed, only if you would view the xml in the tests in your favourite XML reader)

          Show
          Uwe Schindler added a comment - Updated patch, uses Locale.ENGLISH as noted by Robert. Also leaves the DTD in place (Solr+Lucene tests, but it is never parsed, only if you would view the xml in the tests in your favourite XML reader)
          Hide
          Uwe Schindler added a comment -

          Committed truk revision: 1029345

          Now backporting...

          Show
          Uwe Schindler added a comment - Committed truk revision: 1029345 Now backporting...
          Hide
          Uwe Schindler added a comment -

          Committed branch 3.x revision: 1029350

          Backporting bugfix only to 3.0/2.9!

          Show
          Uwe Schindler added a comment - Committed branch 3.x revision: 1029350 Backporting bugfix only to 3.0/2.9!
          Hide
          Uwe Schindler added a comment -

          Committed 3.0 revision: 1029374
          Committed 2.9 revision: 1029375

          Show
          Uwe Schindler added a comment - Committed 3.0 revision: 1029374 Committed 2.9 revision: 1029375

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development