Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      We should offer a standard way to force upper-case tokens. I understand that lowercase is safer for general search quality because some uppercase characters can represent multiple lowercase ones.

      However, having upper-case tokens is often nice for faceting (consider normalizing to standard acronyms)

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit 1556644 from Ryan McKinley in branch 'dev/trunk'
        [ https://svn.apache.org/r1556644 ]

        LUCENE-5369: missing eol:style (merge from 4x)

        Show
        ASF subversion and git services added a comment - Commit 1556644 from Ryan McKinley in branch 'dev/trunk' [ https://svn.apache.org/r1556644 ] LUCENE-5369 : missing eol:style (merge from 4x)
        Hide
        ASF subversion and git services added a comment -

        Commit 1556643 from Ryan McKinley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1556643 ]

        LUCENE-5369: missing eol:style

        Show
        ASF subversion and git services added a comment - Commit 1556643 from Ryan McKinley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1556643 ] LUCENE-5369 : missing eol:style
        Hide
        Uwe Schindler added a comment -

        Yes Character.toUpperCase is fine and locale invariant.

        Show
        Uwe Schindler added a comment - Yes Character.toUpperCase is fine and locale invariant.
        Hide
        Shawn Heisey added a comment -

        Ryan McKinley, this fails precommit because the new files are missing svn:eol-style.

        I actually ran the precommit because I was worried that it would fail the forbidden-apis check. Looks like that only fails on String#toUpperCase if you don't include a locale. Javadocs for Character say that Character#toUpperCase uses Unicode information, so I guess it's OK – and precommit passed just fine after I added svn:eol-style native to the new files.

        Show
        Shawn Heisey added a comment - Ryan McKinley , this fails precommit because the new files are missing svn:eol-style. I actually ran the precommit because I was worried that it would fail the forbidden-apis check. Looks like that only fails on String#toUpperCase if you don't include a locale. Javadocs for Character say that Character#toUpperCase uses Unicode information, so I guess it's OK – and precommit passed just fine after I added svn:eol-style native to the new files.
        Hide
        ASF subversion and git services added a comment -

        Commit 1556618 from Ryan McKinley in branch 'dev/trunk'
        [ https://svn.apache.org/r1556618 ]

        LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens (merge from 4x)

        Show
        ASF subversion and git services added a comment - Commit 1556618 from Ryan McKinley in branch 'dev/trunk' [ https://svn.apache.org/r1556618 ] LUCENE-5369 : Added an UpperCaseFilter to make UPPERCASE tokens (merge from 4x)
        Hide
        ASF subversion and git services added a comment -

        Commit 1556617 from Ryan McKinley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1556617 ]

        LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens

        Show
        ASF subversion and git services added a comment - Commit 1556617 from Ryan McKinley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1556617 ] LUCENE-5369 : Added an UpperCaseFilter to make UPPERCASE tokens
        Hide
        Yonik Seeley added a comment -

        +1, looks fine.

        Show
        Yonik Seeley added a comment - +1, looks fine.
        Hide
        Ryan McKinley added a comment -

        Unless I hear objections, I would like to commit in the next few weeks

        thanks
        ryan

        Show
        Ryan McKinley added a comment - Unless I hear objections, I would like to commit in the next few weeks thanks ryan
        Hide
        Ryan McKinley added a comment -

        Maybe add a boolean option in the factory/filter? To remove code duplication?

        Are you suggesting adding a flag to LowerCaseFilter? I'm think that is more confusing than having a distinct UpperCaseFlter – and the code duplication is essentially the minimum code required for a functioning Filter

        to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels

        I understand and often agree that other tools are more appropriate. But there are lots of cases where the search analysis chain gets you so close to the desired display that duplicating things to a specific facet field seems redundant.

        This is the analyzer I am working with:

        <analyzer>
          <charFilter class="solr.MappingCharFilterFactory" mapping="normalize-my-field-chars.txt"/>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.TrimFilterFactory"/>
          <filter class="solr.ASCIIFoldingFilterFactory"/>
          <filter class="xxx.UpperCaseFilterFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="path/to/synonyms.txt" ignoreCase="false" expand="false"/>
        </analyzer>
        
        Show
        Ryan McKinley added a comment - Maybe add a boolean option in the factory/filter? To remove code duplication? Are you suggesting adding a flag to LowerCaseFilter? I'm think that is more confusing than having a distinct UpperCaseFlter – and the code duplication is essentially the minimum code required for a functioning Filter to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels I understand and often agree that other tools are more appropriate. But there are lots of cases where the search analysis chain gets you so close to the desired display that duplicating things to a specific facet field seems redundant. This is the analyzer I am working with: <analyzer> <charFilter class= "solr.MappingCharFilterFactory" mapping= "normalize-my-field-chars.txt" /> <tokenizer class= "solr.KeywordTokenizerFactory" /> <filter class= "solr.TrimFilterFactory" /> <filter class= "solr.ASCIIFoldingFilterFactory" /> <filter class= "xxx.UpperCaseFilterFactory" /> <filter class= "solr.SynonymFilterFactory" synonyms= "path/to/synonyms.txt" ignoreCase= " false " expand= " false " /> </analyzer>
        Hide
        Robert Muir added a comment -

        My only thoughts are the usual ones: to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels?

        These tasks typically dont require tokenization and work on whole values, and may require stuff like extracting values from one field into another. While its true you can do some of this cleanup (casing/trimming,etc) in the analysis chain by (ab)using the fact that fieldcache uninverts indexed values and using keywordtokenizer and using filters like this, its not very intuitive, and you can't do all of it, whereas using something like solr's updateprocessor chain might be a better place to have this support. There is already overlap, e.g. it can trim field contents as well.

        Show
        Robert Muir added a comment - My only thoughts are the usual ones: to me the analysis chain is not really the best tool to do the job of cleaning up faceting labels? These tasks typically dont require tokenization and work on whole values, and may require stuff like extracting values from one field into another. While its true you can do some of this cleanup (casing/trimming,etc) in the analysis chain by (ab)using the fact that fieldcache uninverts indexed values and using keywordtokenizer and using filters like this, its not very intuitive, and you can't do all of it, whereas using something like solr's updateprocessor chain might be a better place to have this support. There is already overlap, e.g. it can trim field contents as well.
        Hide
        Uwe Schindler added a comment -

        Maybe add a boolean option in the factory/filter? To remove code duplication?

        Show
        Uwe Schindler added a comment - Maybe add a boolean option in the factory/filter? To remove code duplication?
        Hide
        Ryan McKinley added a comment -

        Uwe Schindler] or Robert Muir any thoughts on this?

        thanks
        ryan

        Show
        Ryan McKinley added a comment - Uwe Schindler ] or Robert Muir any thoughts on this? thanks ryan
        Hide
        Ryan McKinley added a comment -

        Here is a patch that adds UpperCaseFilter

        There are a few others out there:
        http://svn.apache.org/repos/asf/uima/addons/trunk/Lucas/src/main/java/org/apache/uima/lucas/indexer/analysis/UpperCaseFilter.java

        https://github.ugent.be/Universiteitsbibliotheek/lludss-solr-java/blob/master/src/main/java/lludss/solr/analysis/UpperCaseFilter.java

        --------

        Given that we would want to steer people to LowerCase, perhaps this should be in a different package

        I'll wait for +1 from someone who knows more about this than me

        Show
        Ryan McKinley added a comment - Here is a patch that adds UpperCaseFilter There are a few others out there: http://svn.apache.org/repos/asf/uima/addons/trunk/Lucas/src/main/java/org/apache/uima/lucas/indexer/analysis/UpperCaseFilter.java https://github.ugent.be/Universiteitsbibliotheek/lludss-solr-java/blob/master/src/main/java/lludss/solr/analysis/UpperCaseFilter.java -------- Given that we would want to steer people to LowerCase, perhaps this should be in a different package I'll wait for +1 from someone who knows more about this than me

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development