Lucene - Core
  1. Lucene - Core
  2. LUCENE-589

Demo HTML parser doesn't work for international documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/examples
    • Labels:
      None

      Description

      Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:

      Add the following line marked with a + to HTMLParser.jj:

      options {
      STATIC = false;
      OPTIMIZE_TOKEN_MANAGER = true;
      //DEBUG_LOOKAHEAD = true;
      //DEBUG_TOKEN_MANAGER = true;
      + UNICODE_INPUT = true;
      }

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        Robert Muir added a comment -

        Committed revision 1031460, 1031462 (3x)

        Show
        Robert Muir added a comment - Committed revision 1031460, 1031462 (3x)
        Hide
        Robert Muir added a comment -

        attached is a patch, it also fixes LUCENE-2246.

        Show
        Robert Muir added a comment - attached is a patch, it also fixes LUCENE-2246 .
        Hide
        Grant Ingersoll added a comment -

        Decrease priority, mark as improvement, since it only affects demo. Also, I'm not sure we need to support other languages as this code should not be used in production anyway.

        Show
        Grant Ingersoll added a comment - Decrease priority, mark as improvement, since it only affects demo. Also, I'm not sure we need to support other languages as this code should not be used in production anyway.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Curtis d'Entremont
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development