Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      For tokens that are used in faceting, it is nice to have standard capitalization.

      I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

      1. SOLR-248-CapitalizationFilter.patch
        12 kB
        Ryan McKinley
      2. SOLR-248-CapitalizationFilter.patch
        12 kB
        Ryan McKinley
      3. SOLR-248-CapitalizationFilter.patch
        12 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        Implementation and test...

        <filter class="solr.CapitalizationFilterFactory" onlyFirstWord="false" keep="and or the is my or de" maxTokenLength="40" maxWordCount="4" okPrefix="McK" forceFirstLetter="true" />

        onlyFirstWord="false" – this capatalizes every word

        keep="and or the is my or de" – don't change capitalization for these words

        forceFirstLetter="true" – capitalize the first letter of the Token (not word) even if it is in the "keep" list

        maxTokenLength="40" – if the token is longer then 40 chars, don't even try to capitalize it

        maxWordCount="4" – if there are more then 4 words, don't try capitalizing

        Show
        Ryan McKinley added a comment - Implementation and test... <filter class="solr.CapitalizationFilterFactory" onlyFirstWord="false" keep="and or the is my or de" maxTokenLength="40" maxWordCount="4" okPrefix="McK" forceFirstLetter="true" /> onlyFirstWord="false" – this capatalizes every word keep="and or the is my or de" – don't change capitalization for these words forceFirstLetter="true" – capitalize the first letter of the Token (not word) even if it is in the "keep" list maxTokenLength="40" – if the token is longer then 40 chars, don't even try to capitalize it maxWordCount="4" – if there are more then 4 words, don't try capitalizing
        Hide
        Hoss Man added a comment -

        1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case.

        2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list?

        3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?)

        Show
        Hoss Man added a comment - 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case. 2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list? 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?)
        Hide
        Ryan McKinley added a comment -

        >
        > 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case.
        >

        probably. that is a good idea

        > 2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list?
        >

        This is one that came of necessity!

        with keep="the ..." and input:
        "Grand army of the Republic", "the arts"

        I want: "Grand Army of the Republic" and "The Arts"

        "forceFirstLetter" only applies to the first character in the token, not to each word.

        > 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?)
        >

        As written, if the prefix matches, it assumes the word capitalization is correct. For my input data, this is sufficient – but it should problem do something smarter.

        So, if you index "McKeen, Mckeen, mckeen, MCKEEN and McKEEN", you would get:

        "McKeen, Mckeen, Mckeen, Mckeen And McKEEN"

        If "okPrefix" was treated as the capitalization for input where the lowercase prefix matches "mck", it would give:

        "McKeen, McKeen, McKeen, McKeen And McKeen"

        Show
        Ryan McKinley added a comment - > > 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case. > probably. that is a good idea > 2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list? > This is one that came of necessity! with keep="the ..." and input: "Grand army of the Republic", "the arts" I want: "Grand Army of the Republic" and "The Arts" "forceFirstLetter" only applies to the first character in the token, not to each word. > 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?) > As written, if the prefix matches, it assumes the word capitalization is correct. For my input data, this is sufficient – but it should problem do something smarter. So, if you index "McKeen, Mckeen, mckeen, MCKEEN and McKEEN", you would get: "McKeen, Mckeen, Mckeen, Mckeen And McKEEN" If "okPrefix" was treated as the capitalization for input where the lowercase prefix matches "mck", it would give: "McKeen, McKeen, McKeen, McKeen And McKeen"
        Hide
        Yonik Seeley added a comment -

        Hmmm, this feels slightly strange implementing at the indexing level.
        What are the ads/disads vs just lowercasing for indexing and capitalizing at the presentation/application layer?

        Show
        Yonik Seeley added a comment - Hmmm, this feels slightly strange implementing at the indexing level. What are the ads/disads vs just lowercasing for indexing and capitalizing at the presentation/application layer?
        Hide
        Ryan McKinley added a comment -

        It is a little strange, but (in my case anyway) i think it makes sense...

        I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) – I want to display the data exactly as it came from the source, but for faceted browsing I need to normalize capitalization.

        Implemented at the indexing level, I can have different values for the stored value and indexed terms. Also, at the indexing level I can leverage existing Tokenizers and Filters to build the tokens that need capitalization – it keeps all the configuration in schema.xml and lets the OAI -> solr xml be a simple transformation, this way whoever takes care of this need only learn solr configuration, not ryan+solr configuration.

        If it is not generally useful I can keep it elsewhere - that is why we have the nice plugin framework!

        Show
        Ryan McKinley added a comment - It is a little strange, but (in my case anyway) i think it makes sense... I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) – I want to display the data exactly as it came from the source, but for faceted browsing I need to normalize capitalization. Implemented at the indexing level, I can have different values for the stored value and indexed terms. Also, at the indexing level I can leverage existing Tokenizers and Filters to build the tokens that need capitalization – it keeps all the configuration in schema.xml and lets the OAI -> solr xml be a simple transformation, this way whoever takes care of this need only learn solr configuration, not ryan+solr configuration. If it is not generally useful I can keep it elsewhere - that is why we have the nice plugin framework!
        Hide
        Yonik Seeley added a comment -

        > Implemented at the indexing level, I can have different values for the stored value and indexed terms.
        One downside is that it complicates certain things like wildcard or prefix queries (capitalizing the first letter and lowercasing the second is something that the QueryParser does not support).

        You could still store the values verbatim, and index as all lowercase.
        Then the application could capitalize the results it gets back as it sees fit.
        I do see value pushing this type of logic back to the search engine though.

        Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc).

        Show
        Yonik Seeley added a comment - > Implemented at the indexing level, I can have different values for the stored value and indexed terms. One downside is that it complicates certain things like wildcard or prefix queries (capitalizing the first letter and lowercasing the second is something that the QueryParser does not support). You could still store the values verbatim, and index as all lowercase. Then the application could capitalize the results it gets back as it sees fit. I do see value pushing this type of logic back to the search engine though. Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc).
        Hide
        Ryan McKinley added a comment -

        >
        >> Implemented at the indexing level, I can have different values for the stored value and indexed terms.
        > One downside is that it complicates certain things like wildcard or prefix queries
        >

        currently i'm using copyfield and doing the prefix query on a different field... not great but it works!

        >
        > Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc).
        >

        Interesting. I could index with a lowercase filter then reformat the facet results... I'll take a look at that after the deadline passes

        Show
        Ryan McKinley added a comment - > >> Implemented at the indexing level, I can have different values for the stored value and indexed terms. > One downside is that it complicates certain things like wildcard or prefix queries > currently i'm using copyfield and doing the prefix query on a different field... not great but it works! > > Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc). > Interesting. I could index with a lowercase filter then reformat the facet results... I'll take a look at that after the deadline passes
        Hide
        J.J. Larrea added a comment -

        While I fully agree that faceting does raise some odd issues stemming from the display of normally-invisible indexed values to humans, and that it theoretically should be responsibility of the front-end to translate index values into human-readable values, there are great practical advantages in both efficiency and convenience to making the indexed values "pretty", and to centralize as much of that as possible in the Analysis stage.

        In particular, I will try this and am very likely to put this into use this weekend, so thank you Ryan! So I'm +1 to adding it to the Solr distribution, though to avoid confusing people it should have a JavaDoc comment explaining that the main use is in faceting to avoid having to introduce such common logic into the presentation-layer.

        Regarding the implementation,

        1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, for 'words' in StopFilter), it would be nice to have a means to specify either a direct list or a filename in the same parameter. A simple approach might be something like keep="word word word..." vs. keep="<file", or even keep="<file <file word word" (with the requirement for backslash-escaping spaces in either)... Or alternately something like txt:filename (vs. xml:filename, json:filename, etc.) with an unescaped : being significant.

        2. Why is so much of the logic in the Factory? This drags Solr-specific stuff in when a user might want to use just the Analyzer in a non-Solr context. Wouldn't it be better in general for Solr Analyzers to be self-complete, with the Factory merely being an adaptor between SolrParams & external resources and the Analyzer's constructor?

        Also, why is keep in a synchronized map, since there is no mutator? (I know, picky picky...)

        Good luck with the deadline!

        Show
        J.J. Larrea added a comment - While I fully agree that faceting does raise some odd issues stemming from the display of normally-invisible indexed values to humans, and that it theoretically should be responsibility of the front-end to translate index values into human-readable values, there are great practical advantages in both efficiency and convenience to making the indexed values "pretty", and to centralize as much of that as possible in the Analysis stage. In particular, I will try this and am very likely to put this into use this weekend, so thank you Ryan! So I'm +1 to adding it to the Solr distribution, though to avoid confusing people it should have a JavaDoc comment explaining that the main use is in faceting to avoid having to introduce such common logic into the presentation-layer. Regarding the implementation, 1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, for 'words' in StopFilter), it would be nice to have a means to specify either a direct list or a filename in the same parameter. A simple approach might be something like keep="word word word..." vs. keep="<file", or even keep="<file <file word word" (with the requirement for backslash-escaping spaces in either)... Or alternately something like txt:filename (vs. xml:filename, json:filename, etc.) with an unescaped : being significant. 2. Why is so much of the logic in the Factory? This drags Solr-specific stuff in when a user might want to use just the Analyzer in a non-Solr context. Wouldn't it be better in general for Solr Analyzers to be self-complete, with the Factory merely being an adaptor between SolrParams & external resources and the Analyzer's constructor? Also, why is keep in a synchronized map, since there is no mutator? (I know, picky picky...) Good luck with the deadline!
        Hide
        Yonik Seeley added a comment -

        > Why is so much of the logic in the Factory?

        I haven't looked at this specific code, but this is my preference in general. multiple TokenFilters are created per-field instance on the index side, and per-query-term on the search side, so it's better to pull all the setup you can out of the Filter for performance reasons.

        Show
        Yonik Seeley added a comment - > Why is so much of the logic in the Factory? I haven't looked at this specific code, but this is my preference in general. multiple TokenFilters are created per-field instance on the index side, and per-query-term on the search side, so it's better to pull all the setup you can out of the Filter for performance reasons.
        Hide
        Ryan McKinley added a comment -

        > Why is so much of the logic in the Factory?

        It seemed silly to copy the same things over and over for each time the type is indexed or queried...

        > why is keep in a synchronized map,

        I'm not sure it needs to be, but i was being cautious... the map is only created once (and never edited) but could be accessed my many threads simultaneously.

        Show
        Ryan McKinley added a comment - > Why is so much of the logic in the Factory? It seemed silly to copy the same things over and over for each time the type is indexed or queried... > why is keep in a synchronized map, I'm not sure it needs to be, but i was being cautious... the map is only created once (and never edited) but could be accessed my many threads simultaneously.
        Hide
        Ryan McKinley added a comment -

        applies with trunk

        Show
        Ryan McKinley added a comment - applies with trunk
        Hide
        Ryan McKinley added a comment -

        1. Added better javadocs explaining the configuration.
        2. removed synchronized map
        3. put the Filter as a package private class in the Factory file – since the filter relies on hte factory, it is not particularly useful outsid solr.

        I would like to add this soon

        Show
        Ryan McKinley added a comment - 1. Added better javadocs explaining the configuration. 2. removed synchronized map 3. put the Filter as a package private class in the Factory file – since the filter relies on hte factory, it is not particularly useful outsid solr. I would like to add this soon
        Hide
        Ryan McKinley added a comment -

        added a while ago

        Show
        Ryan McKinley added a comment - added a while ago
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked "Resolved" and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15

        The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked "Resolved" and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15 The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development