[LUCENE-589] Demo HTML parser doesn't work for international documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/examples
Labels:
None

Description

Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:

Add the following line marked with a + to HTMLParser.jj:

options {
STATIC = false;
OPTIMIZE_TOKEN_MANAGER = true;
//DEBUG_LOOKAHEAD = true;
//DEBUG_TOKEN_MANAGER = true;
+ UNICODE_INPUT = true;
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-589.patch
05/Nov/10 07:31
23 kB
Robert Muir

Activity

People

Assignee:: Robert Muir

Reporter:: Curtis d'Entremont

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Jun/06 22:25

Updated:: 28/Aug/22 11:28

Resolved:: 05/Nov/10 07:45