Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-257

Add ability for WordDelimiterFilter to ignore case changes

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.3
    • Component/s: update
    • Labels:
      None

      Description

      patch adds ignoreCaseChange option to WordDelimiterFilter, which I have used and it may be more generally useful

        Activity

        Hide
        klaasm Mike Klaas added a comment -

        commited in r545597 with inverted logic and yonik's name suggestion.

        Show
        klaasm Mike Klaas added a comment - commited in r545597 with inverted logic and yonik's name suggestion.
        Hide
        klaasm Mike Klaas added a comment -

        Might be a good idea. Case-based splitting is a relatively aggressive default (then again, I'd say the same about stemming).

        I'll leave it out of the patch, and we can always change it later

        Show
        klaasm Mike Klaas added a comment - Might be a good idea. Case-based splitting is a relatively aggressive default (then again, I'd say the same about stemming). I'll leave it out of the patch, and we can always change it later
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Do people think the example "text" fieldType should default to splitOnCaseChange="false"?
        Many people use these fieldType definitions unchanged, until they run into a problem.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Do people think the example "text" fieldType should default to splitOnCaseChange="false"? Many people use these fieldType definitions unchanged, until they run into a problem.
        Hide
        skeptikos J.J. Larrea added a comment -

        Just last week I considered adding a parameter to WDF to suppress splitting on case change; I lazed out by simply throwing a LowerCaseFilter in front of the WDF in my Analyzer chain, but at the time I was thinking that it would get me into trouble if I ever wanted to run the output of WDF into a case-dependent stopword or synonym table. So this is useful and should be committed... though I agree the parameter should be "positive" and I like Yonik's naming suggestion.

        Show
        skeptikos J.J. Larrea added a comment - Just last week I considered adding a parameter to WDF to suppress splitting on case change; I lazed out by simply throwing a LowerCaseFilter in front of the WDF in my Analyzer chain, but at the time I was thinking that it would get me into trouble if I ever wanted to run the output of WDF into a case-dependent stopword or synonym table. So this is useful and should be committed... though I agree the parameter should be "positive" and I like Yonik's naming suggestion.
        Hide
        klaasm Mike Klaas added a comment -

        The difference from generateWordParts is as follows: gWP splits adjacent tokens to see if they are both alpha regardless of how they are delimited. So if gWP=0, then PowerShot=power-shot=Power-Shot=powershot. If gWP=1 and ignoreCaseChange=1, then PowerShot=powershot, but Power-Shot=power-shot=power shot.

        For us, case changes were too "weak" a delimiter, and high idf subwords were inappropriately driving up relevancy on certain docs.

        Show
        klaasm Mike Klaas added a comment - The difference from generateWordParts is as follows: gWP splits adjacent tokens to see if they are both alpha regardless of how they are delimited. So if gWP=0, then PowerShot=power-shot=Power-Shot=powershot. If gWP=1 and ignoreCaseChange=1, then PowerShot=powershot, but Power-Shot=power-shot=power shot. For us, case changes were too "weak" a delimiter, and high idf subwords were inappropriately driving up relevancy on certain docs.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I like splitOnCaseChange=false (default would be true)

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I like splitOnCaseChange=false (default would be true)
        Hide
        hossman Hoss Man added a comment -

        For reference, can you clarify how this differs from generateWordParts?

        i'm also wondering if the option shouldn't be inverted (ie: delimitOnCaseChange) and defaulted to true. all of the existing options are are "positive" in nature, they cause the filter to "do more" when true ... the semantics of this option would be to "do less" when it's true which may be a bit confusing for people

        Show
        hossman Hoss Man added a comment - For reference, can you clarify how this differs from generateWordParts? i'm also wondering if the option shouldn't be inverted (ie: delimitOnCaseChange) and defaulted to true. all of the existing options are are "positive" in nature, they cause the filter to "do more" when true ... the semantics of this option would be to "do less" when it's true which may be a bit confusing for people

          People

          • Assignee:
            klaasm Mike Klaas
            Reporter:
            klaasm Mike Klaas
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development