Solr
  1. Solr
  2. SOLR-257

Add ability for WordDelimiterFilter to ignore case changes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.3
    • Component/s: update
    • Labels:
      None

      Description

      patch adds ignoreCaseChange option to WordDelimiterFilter, which I have used and it may be more generally useful

        Activity

        Hide
        Hoss Man added a comment -

        For reference, can you clarify how this differs from generateWordParts?

        i'm also wondering if the option shouldn't be inverted (ie: delimitOnCaseChange) and defaulted to true. all of the existing options are are "positive" in nature, they cause the filter to "do more" when true ... the semantics of this option would be to "do less" when it's true which may be a bit confusing for people

        Show
        Hoss Man added a comment - For reference, can you clarify how this differs from generateWordParts? i'm also wondering if the option shouldn't be inverted (ie: delimitOnCaseChange) and defaulted to true. all of the existing options are are "positive" in nature, they cause the filter to "do more" when true ... the semantics of this option would be to "do less" when it's true which may be a bit confusing for people
        Hide
        Yonik Seeley added a comment -

        I like splitOnCaseChange=false (default would be true)

        Show
        Yonik Seeley added a comment - I like splitOnCaseChange=false (default would be true)
        Hide
        Mike Klaas added a comment -

        The difference from generateWordParts is as follows: gWP splits adjacent tokens to see if they are both alpha regardless of how they are delimited. So if gWP=0, then PowerShot=power-shot=Power-Shot=powershot. If gWP=1 and ignoreCaseChange=1, then PowerShot=powershot, but Power-Shot=power-shot=power shot.

        For us, case changes were too "weak" a delimiter, and high idf subwords were inappropriately driving up relevancy on certain docs.

        Show
        Mike Klaas added a comment - The difference from generateWordParts is as follows: gWP splits adjacent tokens to see if they are both alpha regardless of how they are delimited. So if gWP=0, then PowerShot=power-shot=Power-Shot=powershot. If gWP=1 and ignoreCaseChange=1, then PowerShot=powershot, but Power-Shot=power-shot=power shot. For us, case changes were too "weak" a delimiter, and high idf subwords were inappropriately driving up relevancy on certain docs.
        Hide
        J.J. Larrea added a comment -

        Just last week I considered adding a parameter to WDF to suppress splitting on case change; I lazed out by simply throwing a LowerCaseFilter in front of the WDF in my Analyzer chain, but at the time I was thinking that it would get me into trouble if I ever wanted to run the output of WDF into a case-dependent stopword or synonym table. So this is useful and should be committed... though I agree the parameter should be "positive" and I like Yonik's naming suggestion.

        Show
        J.J. Larrea added a comment - Just last week I considered adding a parameter to WDF to suppress splitting on case change; I lazed out by simply throwing a LowerCaseFilter in front of the WDF in my Analyzer chain, but at the time I was thinking that it would get me into trouble if I ever wanted to run the output of WDF into a case-dependent stopword or synonym table. So this is useful and should be committed... though I agree the parameter should be "positive" and I like Yonik's naming suggestion.
        Hide
        Yonik Seeley added a comment -

        Do people think the example "text" fieldType should default to splitOnCaseChange="false"?
        Many people use these fieldType definitions unchanged, until they run into a problem.

        Show
        Yonik Seeley added a comment - Do people think the example "text" fieldType should default to splitOnCaseChange="false"? Many people use these fieldType definitions unchanged, until they run into a problem.
        Hide
        Mike Klaas added a comment -

        Might be a good idea. Case-based splitting is a relatively aggressive default (then again, I'd say the same about stemming).

        I'll leave it out of the patch, and we can always change it later

        Show
        Mike Klaas added a comment - Might be a good idea. Case-based splitting is a relatively aggressive default (then again, I'd say the same about stemming). I'll leave it out of the patch, and we can always change it later
        Hide
        Mike Klaas added a comment -

        commited in r545597 with inverted logic and yonik's name suggestion.

        Show
        Mike Klaas added a comment - commited in r545597 with inverted logic and yonik's name suggestion.

          People

          • Assignee:
            Mike Klaas
            Reporter:
            Mike Klaas
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development