Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5921

WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Invalid
    • 4.7
    • 4.7.1
    • Schema and Analysis
    • None

    Description

      WordDelimiterFilterFactory generates word parts although splitting configuration is deactivated.

      This is the fieldType setup from my schema:

      		<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      			<analyzer type="index">
      				<tokenizer class="solr.WhitespaceTokenizerFactory" />
      				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
      				<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
      				<filter class="solr.LowerCaseFilterFactory" />
      			</analyzer>
      			<analyzer type="query">
      				<tokenizer class="solr.WhitespaceTokenizerFactory" />
      				<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
      				<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
      				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  preserveOriginal="1"/>
      				<filter class="solr.LowerCaseFilterFactory" />
      			</analyzer>
      		</fieldType>
      

      The given search term is: X-002-99-495

      WordDelimiterFilterFactory indexes the following word parts:

      • X-002-99-495
      • X (shouldn't be there)
      • 00299495 (shouldn't be there)
      • X00299495

      But the 'X' should not be indexed or queried as a single term. You can see that splitting is completely deactivated in the schema.

      I can move the charater part around in the search term:

      Searching for 002-abc-99-495 gives me

      • 002-abc-99-495
      • 002 (shouldn't be there)
      • abc (shouldn't be there)
      • 99495 (shouldn't be there)
      • 002abc99495

      Even if the term has te following content - WDF split's it up (F00-22-761):

      • F00-22-761
      • F00 (shouldn't be there)
      • 22761 (shouldn't be there)
      • F0022761

      Please have a look at the screenshot.
      This is not what I expect from the configuration! I think this must be a bug.

      Attachments

        1. 2014-03-27 10_43_24-Solr Admin.png
          30 kB
          Malte Hübner
        2. 2014-03-27 09_50_33-Solr Admin.png
          64 kB
          Malte Hübner

        Activity

          People

            Unassigned Unassigned
            mhuebner Malte Hübner
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: