[SOLR-5921] WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Invalid
Affects Version/s: 4.7
Fix Version/s: 4.7.1
Component/s: Schema and Analysis
Labels:
None

Description

WordDelimiterFilterFactory generates word parts although splitting configuration is deactivated.

This is the fieldType setup from my schema:

		<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
			<analyzer type="index">
				<tokenizer class="solr.WhitespaceTokenizerFactory" />
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
				<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
				<filter class="solr.LowerCaseFilterFactory" />
			</analyzer>
			<analyzer type="query">
				<tokenizer class="solr.WhitespaceTokenizerFactory" />
				<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  preserveOriginal="1"/>
				<filter class="solr.LowerCaseFilterFactory" />
			</analyzer>
		</fieldType>

The given search term is: X-002-99-495

WordDelimiterFilterFactory indexes the following word parts:

X-002-99-495
X (shouldn't be there)
00299495 (shouldn't be there)
X00299495

But the 'X' should not be indexed or queried as a single term. You can see that splitting is completely deactivated in the schema.

I can move the charater part around in the search term:

Searching for 002-abc-99-495 gives me

002-abc-99-495
002 (shouldn't be there)
abc (shouldn't be there)
99495 (shouldn't be there)
002abc99495

Even if the term has te following content - WDF split's it up (F00-22-761):

F00-22-761
F00 (shouldn't be there)
22761 (shouldn't be there)
F0022761

Please have a look at the screenshot.
This is not what I expect from the configuration! I think this must be a bug.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2014-03-27 09_50_33-Solr Admin.png
27/Mar/14 08:50
64 kB
Malte Hübner
2014-03-27 10_43_24-Solr Admin.png
27/Mar/14 09:43
30 kB
Malte Hübner

Activity

People

Assignee:: Unassigned

Reporter:: Malte Hübner

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Mar/14 08:50

Updated:: 02/Apr/14 15:03

Resolved:: 27/Mar/14 14:54