Solr
  1. Solr
  2. SOLR-2059

Allow customizing how WordDelimiterFilter tokenizes text.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
      Based on these types and the options provided, it splits and concatenates text.

      In some circumstances, you might need to tweak the behavior of how this works.
      It seems the filter already had this in mind, since you can pass in a custom byte[] type table.
      But its not exposed in the factory.

      I think you should be able to customize the defaults with a configuration file:

      # A customized type mapping for WordDelimiterFilterFactory
      # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
      # 
      # the default for any character without a mapping is always computed from 
      # Unicode character properties
      
      # Map the $, %, '.', and ',' characters to DIGIT 
      # This might be useful for financial data.
      $ => DIGIT
      % => DIGIT
      . => DIGIT
      \u002C => DIGIT
      
      1. SOLR-2059.patch
        11 kB
        Robert Muir

        Activity

        Hide
        Peter Karich added a comment -

        Robert,

        thanks for this work! I have a different application for this patch: in a twitter search # and @ shouldn't be removed. Instead I will handle them like ALPHA, I think.

        Would you mind to update the patch for the latest version of the trunk? I got a problem with WordDelimiterIterator at line 254 if I am using https://svn.apache.org/repos/asf/lucene/dev/trunk/solr and a file is missing problem (line 37) for http://svn.apache.org/repos/asf/solr

        Show
        Peter Karich added a comment - Robert, thanks for this work! I have a different application for this patch: in a twitter search # and @ shouldn't be removed. Instead I will handle them like ALPHA, I think. Would you mind to update the patch for the latest version of the trunk? I got a problem with WordDelimiterIterator at line 254 if I am using https://svn.apache.org/repos/asf/lucene/dev/trunk/solr and a file is missing problem (line 37) for http://svn.apache.org/repos/asf/solr
        Hide
        Robert Muir added a comment -

        Hi Peter:

        thats a great example. For my use case it was actually not the example either, but I was just trying to give a good general example.

        What do you think of the file format, is it ok for describing these categories?
        This format/parser is just stolen the one from MappingCharFilterFactory, it seemed unambiguous and is already in use.

        As far as applying the patch, you need to apply it to https://svn.apache.org/repos/asf/lucene/dev/trunk, not https://svn.apache.org/repos/asf/lucene/dev/trunk/solr

        This is because it has to modify a file in modules, too.

        Show
        Robert Muir added a comment - Hi Peter: thats a great example. For my use case it was actually not the example either, but I was just trying to give a good general example. What do you think of the file format, is it ok for describing these categories? This format/parser is just stolen the one from MappingCharFilterFactory, it seemed unambiguous and is already in use. As far as applying the patch, you need to apply it to https://svn.apache.org/repos/asf/lucene/dev/trunk , not https://svn.apache.org/repos/asf/lucene/dev/trunk/solr This is because it has to modify a file in modules, too.
        Hide
        Peter Karich added a comment - - edited

        Ups, my mistake ... this helped!

        > What do you think of the file format, is it ok for describing these categories?

        I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO:

         
        @ => ALPHA
        # => ALPHA
        
        Show
        Peter Karich added a comment - - edited Ups, my mistake ... this helped! > What do you think of the file format, is it ok for describing these categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO: @ => ALPHA # => ALPHA
        Hide
        Robert Muir added a comment -

        Thanks for the feedback. I'd like to commit (to trunk and 3x) in a few days if no one objects.

        Show
        Robert Muir added a comment - Thanks for the feedback. I'd like to commit (to trunk and 3x) in a few days if no one objects.
        Hide
        Robert Muir added a comment -

        Committed revision 990451 (trunk) 990456 (3x)

        Show
        Robert Muir added a comment - Committed revision 990451 (trunk) 990456 (3x)
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development