Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-589

Demo HTML parser doesn't work for international documents

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 3.1, 4.0-ALPHA
    • modules/examples
    • None

    Description

      Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:

      Add the following line marked with a + to HTMLParser.jj:

      options {
      STATIC = false;
      OPTIMIZE_TOKEN_MANAGER = true;
      //DEBUG_LOOKAHEAD = true;
      //DEBUG_TOKEN_MANAGER = true;
      + UNICODE_INPUT = true;
      }

      Attachments

        1. LUCENE-589.patch
          23 kB
          Robert Muir

        Activity

          People

            rcmuir Robert Muir
            curtispd Curtis d'Entremont
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: