Solr
  1. Solr
  2. SOLR-2059

Allow customizing how WordDelimiterFilter tokenizes text.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
      Based on these types and the options provided, it splits and concatenates text.

      In some circumstances, you might need to tweak the behavior of how this works.
      It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
      But its not exposed in the factory.

      I think you should be able to customize the defaults with a configuration file:

      # A customized type mapping for WordDelimiterFilterFactory
      # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
      # 
      # the default for any character without a mapping is always computed from 
      # Unicode character properties
      
      # Map the $, %, '.', and ',' characters to DIGIT 
      # This might be useful for financial data.
      $ => DIGIT
      % => DIGIT
      . => DIGIT
      \u002C => DIGIT
      
      1. SOLR-2059.patch
        11 kB
        Robert Muir

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development