Lucene - Core
  1. Lucene - Core
  2. LUCENE-589

Demo HTML parser doesn't work for international documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/examples
    • Labels:
      None

      Description

      Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:

      Add the following line marked with a + to HTMLParser.jj:

      options {
      STATIC = false;
      OPTIMIZE_TOKEN_MANAGER = true;
      //DEBUG_LOOKAHEAD = true;
      //DEBUG_TOKEN_MANAGER = true;
      + UNICODE_INPUT = true;
      }

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Curtis d'Entremont
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development