Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5, 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Add a new CharFilter that uses a regular expression for the target of replace string in char stream.

      Usage:

      schema.xml
      <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
          <charFilter class="solr.PatternReplaceCharFilterFactory"
                      groupedPattern="([nN][oO]\.)\s*(\d+)"
                      replaceGroups="1,2" blockDelimiters=":;"/>
          <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        </analyzer>
      </fieldType>
      
      1. SOLR-1653.patch
        22 kB
        Koji Sekiguchi
      2. SOLR-1653.patch
        17 kB
        Koji Sekiguchi

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release
        Hide
        Hoss Man added a comment -

        Correcting Fix Version based on CHANGES.txt, see this thread for more details...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Show
        Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
        Hide
        Koji Sekiguchi added a comment -

        Thanks, Paul! I've just committed revision 897357.

        Show
        Koji Sekiguchi added a comment - Thanks, Paul! I've just committed revision 897357.
        Hide
        Paul taylor added a comment -

        Hi, Im using in non Solr in an analyser, and think there maybe a performance issue because you cannot pass a compiled Pattern. In the reusableTokenStream() method you cannot reset a charfilter like you can a tokenizer so it as to recompile the pattern everytime

        i.e.
        public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        SavedStreams streams = (SavedStreams)getPreviousTokenStream();
        if (streams == null)

        { streams = new SavedStreams(); setPreviousTokenStream(streams); streams.tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT,new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2,reader)); streams.filteredTokenStream = new StandardFilter(streams.filteredTokenStream); streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream); streams.filteredTokenStream = new LowercaseFilter(streams.filteredTokenStream); }

        else

        { streams.tokenStream.reset(new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2",reader)); }

        return streams.filteredTokenStream;
        }

        Show
        Paul taylor added a comment - Hi, Im using in non Solr in an analyser, and think there maybe a performance issue because you cannot pass a compiled Pattern. In the reusableTokenStream() method you cannot reset a charfilter like you can a tokenizer so it as to recompile the pattern everytime i.e. public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { SavedStreams streams = (SavedStreams)getPreviousTokenStream(); if (streams == null) { streams = new SavedStreams(); setPreviousTokenStream(streams); streams.tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT,new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2,reader)); streams.filteredTokenStream = new StandardFilter(streams.filteredTokenStream); streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream); streams.filteredTokenStream = new LowercaseFilter(streams.filteredTokenStream); } else { streams.tokenStream.reset(new PatternReplaceCharFilter("(no\\.) ([0-9]+)","$1$2",reader)); } return streams.filteredTokenStream; }
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 890798. Thanks Shalin and Noble for taking time to review the patch.

        Show
        Koji Sekiguchi added a comment - Committed revision 890798. Thanks Shalin and Noble for taking time to review the patch.
        Hide
        Shalin Shekhar Mangar added a comment -

        If there is no objections, I'll commit later today.

        +1

        Thanks Koji!

        Show
        Shalin Shekhar Mangar added a comment - If there is no objections, I'll commit later today. +1 Thanks Koji!
        Hide
        Koji Sekiguchi added a comment -

        I see that existing "PatternReplaceFilter" (not CharFilter) is using "pattern". But it uses "replacement", not "replaceWith". I think I use "pattern" and "replacement".

        Show
        Koji Sekiguchi added a comment - I see that existing "PatternReplaceFilter" (not CharFilter) is using "pattern". But it uses "replacement", not "replaceWith". I think I use "pattern" and "replacement".
        Hide
        Noble Paul added a comment -

        In Solr we refer to Regular Expression Strings as 'regex' . If you think 'pattern' is ok , go ahead.

        Show
        Noble Paul added a comment - In Solr we refer to Regular Expression Strings as 'regex' . If you think 'pattern' is ok , go ahead.
        Hide
        Koji Sekiguchi added a comment -

        Excuse myself, because I tried to correct offset per group in a match when I started the first patch, I introduced my own syntax. But, yes, now I've implemented the offset correction per match, so I can use standard syntax. Here is the new patch.

        Usage:

        schema.xml
        <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
          <analyzer>
            <charFilter class="solr.PatternReplaceCharFilterFactory"
                        pattern="([nN][oO]\.)\s*(\d+)"
                        replaceWith="$1$2"/>
            <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          </analyzer>
        </fieldType>
        

        If there is no objections, I'll commit later today.

        Show
        Koji Sekiguchi added a comment - Excuse myself, because I tried to correct offset per group in a match when I started the first patch, I introduced my own syntax. But, yes, now I've implemented the offset correction per match, so I can use standard syntax. Here is the new patch. Usage: schema.xml <fieldType name= "textCharNorm" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <charFilter class= "solr.PatternReplaceCharFilterFactory" pattern= "([nN][oO]\.)\s*(\d+)" replaceWith= "$1$2" /> <charFilter class= "solr.MappingCharFilterFactory" mapping= "mapping-ISOLatin1Accent.txt" /> <tokenizer class= "solr.WhitespaceTokenizerFactory" /> </analyzer> </fieldType> If there is no objections, I'll commit later today.
        Hide
        Noble Paul added a comment -

        I need to process one match at a time.

        I guess regex can process one match at a time.

        The most important point is that , we don't need to educate the users on this new syntax. (I am still not clear about the syntax) . No need to write any parsing code and maintain it

        Show
        Noble Paul added a comment - I need to process one match at a time. I guess regex can process one match at a time. The most important point is that , we don't need to educate the users on this new syntax. (I am still not clear about the syntax) . No need to write any parsing code and maintain it
        Hide
        Koji Sekiguchi added a comment -

        I guess this can be achieved with the matcher#replaceAll() directly

        You're right if we don't correct offset of the output char stream. I need to process one match at a time.

        Show
        Koji Sekiguchi added a comment - I guess this can be achieved with the matcher#replaceAll() directly You're right if we don't correct offset of the output char stream. I need to process one match at a time.
        Hide
        Noble Paul added a comment -

        I guess this can be achieved with the matcher#replaceAll() directly

        input = see-ing looking
        regex = (\w+)(ing)
        replaceWith = $1

        input = abc=1234=5678
        regex =(\w+)=(\d+)=(\d+)
        replaceWith=$3=$1=$2

        Show
        Noble Paul added a comment - I guess this can be achieved with the matcher#replaceAll() directly input = see-ing looking regex = (\w+)(ing) replaceWith = $1 input = abc=1234=5678 regex =(\w+)=(\d+)=(\d+) replaceWith=$3=$1=$2
        Hide
        Koji Sekiguchi added a comment - - edited

        Ok. I'll show you same samples

        INPUT groupedPattern replaceGroups OUTPUT comment
        see-ing looking (\w+)(ing) 1 see-ing look remove "ing" from the end of word
        see-ing looking (\w+)ing 1 see-ing look same as above. 2nd parentheses can be omitted
        No.1 NO. no. 543 [nN][oO]\.\s*(\d+) {#}

        ,1

        #1 NO. #543 sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern
        abc=1234=5678 (\w+)=(\d+)=(\d+) 3, {=},1,{=}

        ,2

        5678=abc=1234 change the order of the groups
        Show
        Koji Sekiguchi added a comment - - edited Ok. I'll show you same samples INPUT groupedPattern replaceGroups OUTPUT comment see-ing looking (\w+)(ing) 1 see-ing look remove "ing" from the end of word see-ing looking (\w+)ing 1 see-ing look same as above. 2nd parentheses can be omitted No.1 NO. no. 543 [nN] [oO] \.\s*(\d+) {#} ,1 #1 NO. #543 sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern abc=1234=5678 (\w+)=(\d+)=(\d+) 3, {=},1,{=} ,2 5678=abc=1234 change the order of the groups
        Hide
        Shalin Shekhar Mangar added a comment -

        Koji, even after reading through the test, I do not understand how to use it. Are the characters in curly braces, written down for non-groups only? What if I want to remove one particular group?

        It is always good to write a use-case and an example in the issue description itself.

        Show
        Shalin Shekhar Mangar added a comment - Koji, even after reading through the test, I do not understand how to use it. Are the characters in curly braces, written down for non-groups only? What if I want to remove one particular group? It is always good to write a use-case and an example in the issue description itself.
        Hide
        Koji Sekiguchi added a comment -

        I'll commit in a few days.

        Show
        Koji Sekiguchi added a comment - I'll commit in a few days.

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development