Lucene - Core
  1. Lucene - Core
  2. LUCENE-1466

CharFilter - normalize characters before tokenizer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This proposes to import CharFilter that has been introduced in Solr 1.4.

      Please see for the details:

      1. LUCENE-1466.patch
        40 kB
        Koji Sekiguchi
      2. LUCENE-1466.patch
        28 kB
        Michael McCandless
      3. LUCENE-1466.patch
        29 kB
        Koji Sekiguchi
      4. LUCENE-1466.patch
        29 kB
        Koji Sekiguchi
      5. LUCENE-1466.patch
        28 kB
        Koji Sekiguchi
      6. LUCENE-1466-back.patch
        2 kB
        Michael McCandless
      7. LUCENE-1466-TestCharFilter.patch
        3 kB
        Koji Sekiguchi

        Activity

        Hide
        Michael McCandless added a comment -

        an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you.

        Done... thanks!

        Show
        Michael McCandless added a comment - an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you. Done... thanks!
        Hide
        Koji Sekiguchi added a comment -

        an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you.

        Show
        Koji Sekiguchi added a comment - an additional test for CharFilter that I forgot to move from Solr... Mike, can you commit this? Thank you.
        Hide
        Michael McCandless added a comment -

        Woops, sorry, I did indeed see your new patch and applied it but then failed to svn add. OK I just committed them. Thanks!

        Show
        Michael McCandless added a comment - Woops, sorry, I did indeed see your new patch and applied it but then failed to svn add. OK I just committed them. Thanks!
        Hide
        Koji Sekiguchi added a comment -

        Thank you Mike for committing this! I'll open a ticket for Solr soon. BTW, I cannot see TestMappingCharFilter that is in the latest patch I attached. Is there a problem in the test or just slipped over?

        Show
        Koji Sekiguchi added a comment - Thank you Mike for committing this! I'll open a ticket for Solr soon. BTW, I cannot see TestMappingCharFilter that is in the latest patch I attached. Is there a problem in the test or just slipped over?
        Hide
        Michael McCandless added a comment -

        OK I just committed this. Thanks Koji! Can you open a Solr issue & work out a patch so Solr can cutover to this? Thanks.

        Show
        Michael McCandless added a comment - OK I just committed this. Thanks Koji! Can you open a Solr issue & work out a patch so Solr can cutover to this? Thanks.
        Hide
        Koji Sekiguchi added a comment -

        Added TestMappingCharFilter test case (copied from Solr).

        Show
        Koji Sekiguchi added a comment - Added TestMappingCharFilter test case (copied from Solr).
        Hide
        Koji Sekiguchi added a comment -

        Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc. Koji, are you planning to work out a patch for Solr to switch to Lucene's impl?

        Yeah, why not!

        Show
        Koji Sekiguchi added a comment - Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc. Koji, are you planning to work out a patch for Solr to switch to Lucene's impl? Yeah, why not!
        Hide
        Michael McCandless added a comment -

        OK thanks Koji. I'll add a bit more to the javadocs of BaseCharFilter about the performance caveats.

        I plan to commit in a day or too.

        Thanks for persisting Koji!

        Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc. Koji, are you planning to work out a patch for Solr to switch to Lucene's impl?

        Show
        Michael McCandless added a comment - OK thanks Koji. I'll add a bit more to the javadocs of BaseCharFilter about the performance caveats. I plan to commit in a day or too. Thanks for persisting Koji! Solr has already committed CharFilter, and had to workaround it not being in Lucene with classes like CharStreamAwareTokenizer, etc. Koji, are you planning to work out a patch for Solr to switch to Lucene's impl?
        Hide
        Koji Sekiguchi added a comment -

        Oops. Thanks for the updated patch, Mike!

        • Can you add a CHANGES entry describing this new feature, as well
          as the change in type of Tokenizer.input?
        • Can we rename NormalizeMap -> NormalizeCharMap?
        • Could you add some javadocs to NormalizeCharMap,
          MappingCharFilter, BaseCharFilter?

        Your patch looks good!

        • The BaseCharFilter correct method looks spookily costly (has a for
          loop, going backwards for all added mappings). It seems like in
          practice it should not be costly, because typically one corrects
          the offset only for the "current" token? And, one could always
          build their own CharFilter (eg using arrays of ints or something)
          if they needed a more efficient mapping.

        Yes, users can create their own CharFilter if they needed a more efficient mapping.

        • MappingCharFilter's match method is recursive. But I think the
          depth of that recursion equals the length of character sequence
          that's being mapped, right? So risk of stack overlflow should be
          basically zero, unless someone does some insanely long character
          string mappings?

        You are correct.

        I think we should make an exception to back-compat here, and simply
        change TokenStream.input from Reader to CharStream (subclasses
        Reader). Properly respecting back-compat will be alot of work, and,
        if external subclasses are directly assigning to input, they really
        ought to be using reaset(Reader) instead.

        I agree with you, Mike.

        Show
        Koji Sekiguchi added a comment - Oops. Thanks for the updated patch, Mike! Can you add a CHANGES entry describing this new feature, as well as the change in type of Tokenizer.input? Can we rename NormalizeMap -> NormalizeCharMap? Could you add some javadocs to NormalizeCharMap, MappingCharFilter, BaseCharFilter? Your patch looks good! The BaseCharFilter correct method looks spookily costly (has a for loop, going backwards for all added mappings). It seems like in practice it should not be costly, because typically one corrects the offset only for the "current" token? And, one could always build their own CharFilter (eg using arrays of ints or something) if they needed a more efficient mapping. Yes, users can create their own CharFilter if they needed a more efficient mapping. MappingCharFilter's match method is recursive. But I think the depth of that recursion equals the length of character sequence that's being mapped, right? So risk of stack overlflow should be basically zero, unless someone does some insanely long character string mappings? You are correct. I think we should make an exception to back-compat here, and simply change TokenStream.input from Reader to CharStream (subclasses Reader). Properly respecting back-compat will be alot of work, and, if external subclasses are directly assigning to input, they really ought to be using reaset(Reader) instead. I agree with you, Mike.
        Hide
        Michael McCandless added a comment -

        I think we should make an exception to back-compat here, and simply
        change TokenStream.input from Reader to CharStream (subclasses
        Reader). Properly respecting back-compat will be alot of work, and,
        if external subclasses are directly assigning to input, they really
        ought to be using reaset(Reader) instead.

        I updated the patch with the above issues, fixed some whitespace
        issues, added Tokenizer.reset(CharStream) and patched back-compat.

        Show
        Michael McCandless added a comment - I think we should make an exception to back-compat here, and simply change TokenStream.input from Reader to CharStream (subclasses Reader). Properly respecting back-compat will be alot of work, and, if external subclasses are directly assigning to input, they really ought to be using reaset(Reader) instead. I updated the patch with the above issues, fixed some whitespace issues, added Tokenizer.reset(CharStream) and patched back-compat.
        Hide
        Michael McCandless added a comment -

        Thanks for the update Koji!

        The patch looks good. Some questions:

        • Can you add a CHANGES entry describing this new feature, as well
          as the change in type of Tokenizer.input?
        • Can we rename NormalizeMap -> NormalizeCharMap?
        • Could you add some javadocs to NormalizeCharMap,
          MappingCharFilter, BaseCharFilter?
        • The BaseCharFilter correct method looks spookily costly (has a for
          loop, going backwards for all added mappings). It seems like in
          practice it should not be costly, because typically one corrects
          the offset only for the "current" token? And, one could always
          build their own CharFilter (eg using arrays of ints or something)
          if they needed a more efficient mapping.
        • MappingCharFilter's match method is recursive. But I think the
          depth of that recursion equals the length of character sequence
          that's being mapped, right? So risk of stack overlflow should be
          basically zero, unless someone does some insanely long character
          string mappings?

        I have some back-compat concerns. First, I see these 2 failures in
        "ant test-tag":

        [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):	Caused an ERROR
        [junit] input
        [junit] java.lang.NoSuchFieldError: input
        [junit] 	at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
        [junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
        [junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
        [junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
        [junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
        [junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
        [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
        [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:317)
        [junit] 
        [junit] 
        [junit] Testcase: testInclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):	Caused an ERROR
        [junit] input
        [junit] java.lang.NoSuchFieldError: input
        [junit] 	at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
        [junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
        [junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
        [junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
        [junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
        [junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
        [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
        [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
        [junit] 	at org.apache.lucene.search.TestRangeQuery.testInclusiveLowerNull(TestRangeQuery.java:351)
        

        These are JAR drop-inability failures, because the type of
        Tokenizer.input changed from Reader to CharStream. Since CharStream
        subclasses Reader, references to Tokenizer.input would be fixed w/ a
        simple recompile.

        However, assignments to "input" by external subclasses of Tokenizer
        will result in compilation error. You have to replace such
        assignments with this.input = CharReader.get(input). Since input
        is protected, any subclass can up and assign to it. The good news is
        this'd be a catastrophic compilation error (vs something silent at
        runtime); the bad news is that's [unfortunately] against our
        back-compat policies.

        Any ideas how we can fix this to "migrate" to CharStream without
        breaking back compat?

        Show
        Michael McCandless added a comment - Thanks for the update Koji! The patch looks good. Some questions: Can you add a CHANGES entry describing this new feature, as well as the change in type of Tokenizer.input? Can we rename NormalizeMap -> NormalizeCharMap? Could you add some javadocs to NormalizeCharMap, MappingCharFilter, BaseCharFilter? The BaseCharFilter correct method looks spookily costly (has a for loop, going backwards for all added mappings). It seems like in practice it should not be costly, because typically one corrects the offset only for the "current" token? And, one could always build their own CharFilter (eg using arrays of ints or something) if they needed a more efficient mapping. MappingCharFilter's match method is recursive. But I think the depth of that recursion equals the length of character sequence that's being mapped, right? So risk of stack overlflow should be basically zero, unless someone does some insanely long character string mappings? I have some back-compat concerns. First, I see these 2 failures in "ant test-tag": [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] input [junit] java.lang.NoSuchFieldError: input [junit] at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289) [junit] at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:317) [junit] [junit] [junit] Testcase: testInclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] input [junit] java.lang.NoSuchFieldError: input [junit] at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289) [junit] at org.apache.lucene.search.TestRangeQuery.testInclusiveLowerNull(TestRangeQuery.java:351) These are JAR drop-inability failures, because the type of Tokenizer.input changed from Reader to CharStream. Since CharStream subclasses Reader, references to Tokenizer.input would be fixed w/ a simple recompile. However, assignments to "input" by external subclasses of Tokenizer will result in compilation error. You have to replace such assignments with this.input = CharReader.get(input) . Since input is protected, any subclass can up and assign to it. The good news is this'd be a catastrophic compilation error (vs something silent at runtime); the bad news is that's [unfortunately] against our back-compat policies. Any ideas how we can fix this to "migrate" to CharStream without breaking back compat?
        Hide
        Koji Sekiguchi added a comment - - edited

        updated patch attached.

        • sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
        • added a useful idiom to get CharStream and make private CharReader constructor
        Show
        Koji Sekiguchi added a comment - - edited updated patch attached. sync trunk (smart chinese analyzer( LUCENE-1629 ), etc.) added a useful idiom to get CharStream and make private CharReader constructor
        Hide
        Michael McCandless added a comment -

        I'd like to get this in for 2.9.

        Show
        Michael McCandless added a comment - I'd like to get this in for 2.9.
        Hide
        Koji Sekiguchi added a comment -

        If I can vote for it, I want it to be part of 2.9. I know several (at least five) companies use this feature in production. We use it via Solr (SOLR-822), but we hope it to be part of Lucene core.

        Show
        Koji Sekiguchi added a comment - If I can vote for it, I want it to be part of 2.9. I know several (at least five) companies use this feature in production. We use it via Solr ( SOLR-822 ), but we hope it to be part of Lucene core.
        Hide
        Robert Muir added a comment -

        just as an alternative, i have a different mechanism as part of lucene-1488 patch I am working on. But maybe its good to have options, as it depends on the ICU library.

        below is excerpt from javadoc.

        A TokenFilter that transforms text with ICU.

        ICU provides text-transformation functionality via its Transliteration API.
        Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks.
        ...
        Some useful transformations for search are built-in:

        • Conversion from Traditional to Simplified Chinese characters
        • Conversion from Hiragana to Katakana
        • Conversion from Fullwidth to Halfwidth forms.
          ...
          Example usage:
        • stream = new ICUTransformFilter(stream, Transliterator.getInstance("Traditional-Simplified"));
        Show
        Robert Muir added a comment - just as an alternative, i have a different mechanism as part of lucene-1488 patch I am working on. But maybe its good to have options, as it depends on the ICU library. below is excerpt from javadoc. A TokenFilter that transforms text with ICU. ICU provides text-transformation functionality via its Transliteration API. Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. ... Some useful transformations for search are built-in: Conversion from Traditional to Simplified Chinese characters Conversion from Hiragana to Katakana Conversion from Fullwidth to Halfwidth forms. ... Example usage: stream = new ICUTransformFilter(stream, Transliterator.getInstance("Traditional-Simplified"));
        Hide
        Mark Miller added a comment -

        Anyone want to step up for this one or should we push it off to 3.0?

        Show
        Mark Miller added a comment - Anyone want to step up for this one or should we push it off to 3.0?
        Hide
        Koji Sekiguchi added a comment -

        renamed correctPosition() to correct() because it is for correcting token offset, not for token position.

        Show
        Koji Sekiguchi added a comment - renamed correctPosition() to correct() because it is for correcting token offset, not for token position.
        Hide
        Koji Sekiguchi added a comment -

        a patch attached.

        Show
        Koji Sekiguchi added a comment - a patch attached.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Koji Sekiguchi
          • Votes:
            4 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development