Solr
  1. Solr
  2. SOLR-330

Use new Lucene Token APIs (reuse and char[] buff)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Lucene is getting new Token APIs for better performance.

      • token reuse
      • char[] offset + len instead of String
        Requires a new version of lucene.
      1. SOLR-330.patch
        39 kB
        Grant Ingersoll
      2. SOLR-330.patch
        54 kB
        Grant Ingersoll
      3. token_filter.patch
        1 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Solr may also benefit from the new TeeTokenFilter and SinkTokenizer. Intelligent use of these by the copy field mechanism MAY speed up things by significantly reducing the amount of redundant analysis work.

          Show
          Grant Ingersoll added a comment - Solr may also benefit from the new TeeTokenFilter and SinkTokenizer. Intelligent use of these by the copy field mechanism MAY speed up things by significantly reducing the amount of redundant analysis work.
          Hide
          Grant Ingersoll added a comment -

          First draft of a patch that updates the various TokenFilters, etc. in Solr to use the new Lucene reuse API. Notes on implementation below:

          Also, cleans up some of the javadocs in various files

          Added Test for the Porter stemmer.

          Cleaned up some string literals to be constants so that they can be safely referred to in the tests.

          In the PatternTokenFilter, it would be cool if there was a way to just operate on the char array, but I don't see that the Pattern/Matcher API supports it.

          Same goes for PhoneticTokenFilter

          I'm not sure yet if the BufferedTokenStream can take advantage of reuse, so I have left them alone for now, other than some minor doc fixes. I will think about this some more.

          In RemoveDuplicatesTF, I only converted to using termBuffer, not Token reuse. I removed the "IN" and "OUT" loop labels, as I don't see what functionality they provide.

          Added ArraysUtils class and test to provide a bit more functionality than Arrays.java offers in terms of comparing two char arrays. This could be expanded at some point to cover other primitive comparisons.

          My understanding of the new reusableTokenStream means we can't use it in the SolrAnalyzer

          On the TrimFilter, it is not clear to me that there would be a token that is ever all whitespace. However, since the test handles it, I wonder why the a Token of " ", when update offsets are on, reports the offsets as the end and not the start. Just a minor nit, but it seems like the start/end offsets should be 0, not the end of the token.

          I'm not totally sure on the WordDelimiterFilter, as there is a fair amount of new token creation, Also, I think, the newTok() method doesn't set the position increment based on the original position increment, so I added that.

          I'm also not completely sure how to handle FieldType DefaultAnalyzer.next(). It seems like it could reuse the token

          Also not sure why the duplicate code for the MultiValueTokenStream in HighlighterUtils and SolrHighlighter, so I left the highlighter TokenStreams alone.

          Show
          Grant Ingersoll added a comment - First draft of a patch that updates the various TokenFilters, etc. in Solr to use the new Lucene reuse API. Notes on implementation below: Also, cleans up some of the javadocs in various files Added Test for the Porter stemmer. Cleaned up some string literals to be constants so that they can be safely referred to in the tests. In the PatternTokenFilter, it would be cool if there was a way to just operate on the char array, but I don't see that the Pattern/Matcher API supports it. Same goes for PhoneticTokenFilter I'm not sure yet if the BufferedTokenStream can take advantage of reuse, so I have left them alone for now, other than some minor doc fixes. I will think about this some more. In RemoveDuplicatesTF, I only converted to using termBuffer, not Token reuse. I removed the "IN" and "OUT" loop labels, as I don't see what functionality they provide. Added ArraysUtils class and test to provide a bit more functionality than Arrays.java offers in terms of comparing two char arrays. This could be expanded at some point to cover other primitive comparisons. My understanding of the new reusableTokenStream means we can't use it in the SolrAnalyzer On the TrimFilter, it is not clear to me that there would be a token that is ever all whitespace. However, since the test handles it, I wonder why the a Token of " ", when update offsets are on, reports the offsets as the end and not the start. Just a minor nit, but it seems like the start/end offsets should be 0, not the end of the token. I'm not totally sure on the WordDelimiterFilter, as there is a fair amount of new token creation, Also, I think, the newTok() method doesn't set the position increment based on the original position increment, so I added that. I'm also not completely sure how to handle FieldType DefaultAnalyzer.next(). It seems like it could reuse the token Also not sure why the duplicate code for the MultiValueTokenStream in HighlighterUtils and SolrHighlighter, so I left the highlighter TokenStreams alone.
          Hide
          Grant Ingersoll added a comment -

          Note, this patch also includes SOLR-468

          Show
          Grant Ingersoll added a comment - Note, this patch also includes SOLR-468
          Hide
          Yonik Seeley added a comment -

          WordDelimiterFilter, as there is a fair amount of new token creation, Also, I think, the newTok() method doesn't set the position increment based on the original position increment, so I added that.

          Careful, that might introduce a bug... the position increments in WDF are generally set elsewhere, and setting it in newTok might cause more than one token to get that increment.

          Show
          Yonik Seeley added a comment - WordDelimiterFilter, as there is a fair amount of new token creation, Also, I think, the newTok() method doesn't set the position increment based on the original position increment, so I added that. Careful, that might introduce a bug... the position increments in WDF are generally set elsewhere, and setting it in newTok might cause more than one token to get that increment.
          Hide
          Grant Ingersoll added a comment -

          OK, I removed the WDF setPositionIncrement addition, otherwise patch is as before minus the Capitalization stuff that has already been committed.

          I plan to commit tomorrow or Monday.

          Show
          Grant Ingersoll added a comment - OK, I removed the WDF setPositionIncrement addition, otherwise patch is as before minus the Capitalization stuff that has already been committed. I plan to commit tomorrow or Monday.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 643465.

          Show
          Grant Ingersoll added a comment - Committed revision 643465.
          Hide
          Yonik Seeley added a comment -

          Attaching token_filter.patch, minor update to synonym and WFD to prevent extra token creation.

          Show
          Yonik Seeley added a comment - Attaching token_filter.patch, minor update to synonym and WFD to prevent extra token creation.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development