Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6879

Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 6.0
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New

      Description

      As a followup from LUCENE-6874, I thought about how to generate custom CharTokenizers wthout subclassing. I had this quite often and I was a bit annoyed, that you had to create a subclass every time.

      This issue is using the pattern like ThreadLocal or many collection methods in Java 8: You have the (abstract) base class and you define a factory method named fromXxxPredicate (like ThreadLocal.withInitial(() -> value).

      public static CharTokenizer fromTokenCharPredicate(java.util.function.IntPredicate predicate)
      

      This would allow to define a new CharTokenizer with a single line statement using any predicate:

      // long variant with lambda:
      Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> !UCharacter.isUWhiteSpace(c));
      
      // method reference for separator char predicate + normalization by uppercasing:
      Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, Character::toUpperCase);
      
      // method reference to custom function:
      private boolean myTestFunction(int c) {
       return (cracy condition);
      }
      Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
      

      I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users this Java 8-like way would be easy and elegant to use. It is fast as hell, as it is just a reference to a method and Java 8 is optimized for that.

      The inverted factories fromSeparatorCharPredicate() are provided to allow quick definition without lambdas using method references. In lots of cases, like WhitespaceTokenizer, predicates are on the separator chars (isWhitespace(int), so using the 2nd set of factories you can define them without the counter-intuitive negation. Internally it just uses Predicate#negate().

      The factories also allow to give the normalization function, e.g. to Lowercase, you may just give Character::toLowerCase as IntUnaryOperator reference.

      1. LUCENE-6879.patch
        11 kB
        Uwe Schindler
      2. LUCENE-6879.patch
        7 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          thetaphi Uwe Schindler added a comment -

          Patch using Java 8s new functional APIs. Very cool and simple to define a new Tokenizer.

          I only don't like that CharTokenizer is in oal.analysis.util package. Maybe we should move the factories to a separate class in the oal.analysis,core pkg.

          The patch also has some tests showing how you would use them.

          Show
          thetaphi Uwe Schindler added a comment - Patch using Java 8s new functional APIs. Very cool and simple to define a new Tokenizer. I only don't like that CharTokenizer is in oal.analysis.util package. Maybe we should move the factories to a separate class in the oal.analysis,core pkg. The patch also has some tests showing how you would use them.
          Hide
          rcmuir Robert Muir added a comment -

          I think the tests are nice examples and like the separator vs tokenchar methods (it can be hard to think about opposites).

          Good improvement for java 8 on trunk.

          Show
          rcmuir Robert Muir added a comment - I think the tests are nice examples and like the separator vs tokenchar methods (it can be hard to think about opposites). Good improvement for java 8 on trunk.
          Hide
          thetaphi Uwe Schindler added a comment -

          We can improve the Javadocs by adding the examples. I just wanted to quickly write the patch to demonstrate how it could look like. We can also discuss about method names. The pattern follows convention used for all functional interfaces in Java 8 (method naming), but we can make it more readable. I am open to suggestions.

          In Lucene trunk we can also remove all the separate implementations like LetterTokenizer and just allow them to be produced by factories. This would be a slight break, but we could still provide the Solr/CustomAnalyzer factories as usual. The Tokenizer for ICU in LUCENE-6874 could also be a one-liner just provided by the Solr factory, but no actual instance

          We could also provide a one-for all Solr/CustomAnalyzer factory using a Enum of predicate/normalizer functions to be choosen by string parameter.

          Show
          thetaphi Uwe Schindler added a comment - We can improve the Javadocs by adding the examples. I just wanted to quickly write the patch to demonstrate how it could look like. We can also discuss about method names. The pattern follows convention used for all functional interfaces in Java 8 (method naming), but we can make it more readable. I am open to suggestions. In Lucene trunk we can also remove all the separate implementations like LetterTokenizer and just allow them to be produced by factories. This would be a slight break, but we could still provide the Solr/CustomAnalyzer factories as usual. The Tokenizer for ICU in LUCENE-6874 could also be a one-liner just provided by the Solr factory, but no actual instance We could also provide a one-for all Solr/CustomAnalyzer factory using a Enum of predicate/normalizer functions to be choosen by string parameter.
          Hide
          dweiss Dawid Weiss added a comment -

          Pretty cool, Uwe!

          It is fast as hell

          I always thought hell was about slow and endless suffering?

          Show
          dweiss Dawid Weiss added a comment - Pretty cool, Uwe! It is fast as hell I always thought hell was about slow and endless suffering?
          Hide
          thetaphi Uwe Schindler added a comment -

          I always thought hell was about slow and endless suffering?

          Ähm, yes

          But this video tells you different: https://www.youtube.com/watch?v=Uqa8MFSXZHM
          If you need to burn fat, fast as hell: http://www.amazon.com/ULTIMATE-CUTS-SECRETS-English-Edition-ebook/dp/B00HMQS8TA

          Show
          thetaphi Uwe Schindler added a comment - I always thought hell was about slow and endless suffering? Ähm, yes But this video tells you different: https://www.youtube.com/watch?v=Uqa8MFSXZHM If you need to burn fat, fast as hell: http://www.amazon.com/ULTIMATE-CUTS-SECRETS-English-Edition-ebook/dp/B00HMQS8TA
          Hide
          dsmiley David Smiley added a comment -

          +1 Nice Uwe.

          Show
          dsmiley David Smiley added a comment - +1 Nice Uwe.
          Hide
          thetaphi Uwe Schindler added a comment -

          New patch with improved Javadocs. Will commit this soon.

          Show
          thetaphi Uwe Schindler added a comment - New patch with improved Javadocs. Will commit this soon.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1712682 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1712682 ]

          LUCENE-6879: Allow to define custom CharTokenizer instances without subclassing using Java 8 lambdas or method references

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1712682 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1712682 ] LUCENE-6879 : Allow to define custom CharTokenizer instances without subclassing using Java 8 lambdas or method references
          Hide
          thetaphi Uwe Schindler added a comment -

          Thanks for review!

          Show
          thetaphi Uwe Schindler added a comment - Thanks for review!
          Hide
          thetaphi Uwe Schindler added a comment -

          Just FYI: I did some quick microbenchmark like this:

          // init & warmup
          String text = "Tokenizer(Test)FooBar";
          String[] result = new String[] { "tokenizer", "test", "foobar" };
          final Tokenizer tokenizer1 = CharTokenizer.fromTokenCharPredicate(Character::isLetter, Character::toLowerCase);
          for (int i = 0; i < 10000; i++) {
            tokenizer1.setReader(new StringReader(text));
            assertTokenStreamContents(tokenizer1, result);
          }
          final Tokenizer tokenizer2 = new LowerCaseTokenizer();
          for (int i = 0; i < 10000; i++) {
            tokenizer2.setReader(new StringReader(text));
            assertTokenStreamContents(tokenizer2, result);
          }
          
          // speed test
          long [] lens1 = new long[100], lens2 = new long[100]; 
          for (int j = 0; j < 100; j++) {
            System.out.println("Run: " + j);
            long start1 = System.currentTimeMillis();
            for (int i = 0; i < 1000000; i++) {
              tokenizer1.setReader(new StringReader(text));
              assertTokenStreamContents(tokenizer1, result);
            }
            lens1[j] = System.currentTimeMillis() - start1;
            
            long start2 = System.currentTimeMillis();
            for (int i = 0; i < 1000000; i++) {
              tokenizer2.setReader(new StringReader(text));
              assertTokenStreamContents(tokenizer2, result);
            }
            lens2[j] = System.currentTimeMillis() - start2;
          }
          
          System.out.println("Time Lambda: " + Arrays.stream(lens1).summaryStatistics());
          System.out.println("Time Old: " + Arrays.stream(lens2).summaryStatistics());
          

          I was not able to find any speed difference after warmup:

          • Time Lambda: LongSummaryStatistics {count=100, sum=58267, min=562, average=582.670000, max=871}
          • Time Old: LongSummaryStatistics {count=100, sum=61489, min=600, average=614.890000, max=721}
          Show
          thetaphi Uwe Schindler added a comment - Just FYI: I did some quick microbenchmark like this: // init & warmup String text = "Tokenizer(Test)FooBar" ; String [] result = new String [] { "tokenizer" , "test" , "foobar" }; final Tokenizer tokenizer1 = CharTokenizer.fromTokenCharPredicate( Character ::isLetter, Character ::toLowerCase); for ( int i = 0; i < 10000; i++) { tokenizer1.setReader( new StringReader(text)); assertTokenStreamContents(tokenizer1, result); } final Tokenizer tokenizer2 = new LowerCaseTokenizer(); for ( int i = 0; i < 10000; i++) { tokenizer2.setReader( new StringReader(text)); assertTokenStreamContents(tokenizer2, result); } // speed test long [] lens1 = new long [100], lens2 = new long [100]; for ( int j = 0; j < 100; j++) { System .out.println( "Run: " + j); long start1 = System .currentTimeMillis(); for ( int i = 0; i < 1000000; i++) { tokenizer1.setReader( new StringReader(text)); assertTokenStreamContents(tokenizer1, result); } lens1[j] = System .currentTimeMillis() - start1; long start2 = System .currentTimeMillis(); for ( int i = 0; i < 1000000; i++) { tokenizer2.setReader( new StringReader(text)); assertTokenStreamContents(tokenizer2, result); } lens2[j] = System .currentTimeMillis() - start2; } System .out.println( "Time Lambda: " + Arrays.stream(lens1).summaryStatistics()); System .out.println( "Time Old: " + Arrays.stream(lens2).summaryStatistics()); I was not able to find any speed difference after warmup: Time Lambda: LongSummaryStatistics {count=100, sum=58267, min=562, average=582.670000, max=871} Time Old: LongSummaryStatistics {count=100, sum=61489, min=600, average=614.890000, max=721}
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1713098 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1713098 ]

          LUCENE-6879: Add missing null checks for parameters

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1713098 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1713098 ] LUCENE-6879 : Add missing null checks for parameters
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1713099 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1713099 ]

          Merge additional null check from LUCENE-6879

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1713099 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713099 ] Merge additional null check from LUCENE-6879

            People

            • Assignee:
              thetaphi Uwe Schindler
              Reporter:
              thetaphi Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development