[LUCENE-3022] DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.9.4, 3.1
Fix Version/s: 4.9, 6.0
Component/s: modules/analysis
Labels:
- dead

Lucene Fields:

New

Description

When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour:
The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire) which makes no sense at all.
I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen", but it had no effect.
So I reviewed the sourcecode and found the problem:
[code]
protected void decomposeInternal(final Token token) {
// Only words longer than minWordSize get processed
if (token.length() < this.minWordSize)

{ return; }

char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());

for (int i=0;i<token.length()-this.minSubwordSize;++i) {
Token longestMatchToken=null;
for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
if(i+j>token.length()) { break; }
if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
if (longestMatchToken.length()<j) { longestMatchToken=createToken(i,j,token); }
} else { longestMatchToken=createToken(i,j,token); }
} else { tokens.add(createToken(i,j,token)); }
}
}
if (this.onlyLongestMatch && longestMatchToken!=null) { tokens.add(longestMatchToken); }
}
}
[/code]

should be changed to

[code]
protected void decomposeInternal(final Token token) {
// Only words longer than minWordSize get processed
if (token.termLength() < this.minWordSize) { return; }

char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());

Token longestMatchToken=null;
for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {

for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
if(i+j>token.termLength())

{ break; }

if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
if (longestMatchToken.termLength()<j)

{ longestMatchToken=createToken(i,j,token); }

} else

{ longestMatchToken=createToken(i,j,token); }

} else

{ tokens.add(createToken(i,j,token)); }

}
}
}
if (this.onlyLongestMatch && longestMatchToken!=null)

{ tokens.add(longestMatchToken); }

}
[/code]

So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3022.patch
14/Apr/11 09:36
6 kB
Johann Höchtl
LUCENE-3022.patch
14/Apr/11 16:05
6 kB
Robert Muir

Issue Links

relates to

LUCENE-8183 HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Open

Activity

People

Assignee:: Robert Muir

Reporter:: Johann Höchtl

Votes:: 1 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 12/Apr/11 09:15

Updated:: 28/Aug/22 12:44

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified