Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
6.6
-
None
-
None
-
Configuration of the analyzer:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory"
hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
dictionary="lang/wordlist_de.txt"
onlyLongestMatch="true"/>Configuration of the analyzer: <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" dictionary="lang/wordlist_de.txt" onlyLongestMatch="true"/>
-
New
Description
The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled.
Example:
Dictionary: gesellschaft, schaft
Hyphenator: de_DR.xml //from Apche Offo
onlyLongestMatch: true
text | gesellschaft | gesellschaft | schaft |
raw_bytes | [67 65 73 65 6c 6c 73 63 68 61 66 74] | [67 65 73 65 6c 6c 73 63 68 61 66 74] | [73 63 68 61 66 74] |
start | 0 | 0 | 0 |
end | 12 | 12 | 12 |
positionLength | 1 | 1 | 1 |
type | word | word | word |
position | 1 | 1 | 1 |
IMHO this includes 2 unexpected Tokens
- the 2nd 'gesellschaft' as it duplicates the original token
- the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary
Attachments
Attachments
Issue Links
- is related to
-
LUCENE-3022 DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
- Open