Lucene - Core
  1. Lucene - Core
  2. LUCENE-2372

Replace deprecated TermAttribute by new CharTermAttribute

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      After LUCENE-2302 is merged to trunk with flex, we need to carry over all tokenizers and consumers of the TokenStreams to the new CharTermAttribute.

      We should also think about adding a AttributeFactory that creates a subclass of CharTermAttributeImpl that returns collation keys in toBytesRef() accessor. CollationKeyFilter is then obsolete, instead you can simply convert every TokenStream to indexing only CollationKeys by changing the attribute implementation.

      1. ASF.LICENSE.NOT.GRANTED--LUCENE-2372.patch
        87 kB
        Uwe Schindler
      2. ASF.LICENSE.NOT.GRANTED--LUCENE-2372.patch
        88 kB
        Uwe Schindler
      3. LUCENE-2372_contrib_solr.patch
        274 kB
        Robert Muir
      4. LUCENE-2372_contrib.patch
        246 kB
        Robert Muir
      5. LUCENE-2372_contrib.patch
        222 kB
        Robert Muir
      6. LUCENE-2372_contrib.patch
        133 kB
        Robert Muir
      7. LUCENE-2372_contrib.patch
        64 kB
        Robert Muir
      8. LUCENE-2372.patch
        83 kB
        Uwe Schindler
      9. LUCENE-2372.patch
        82 kB
        Uwe Schindler
      10. LUCENE-2372.patch
        28 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Here a first patch for the core tokenstreams. Tests not yet changed.

          The following things were additionally fixed:

          • StandardAnalyzer was made final (backwards break, we forgot to made it final in the 3.0 TS finalization issue). This enabled me to subclass StopwordAnalyzerBase and remove heavy code duplication. The original code also contained a bug in the tokenStream method (no setReplaceInvalidAcronym) which was correctin reusableTokenStream. Now it is correct.

          I will post further patches for core.

          Show
          Uwe Schindler added a comment - Here a first patch for the core tokenstreams. Tests not yet changed. The following things were additionally fixed: StandardAnalyzer was made final (backwards break, we forgot to made it final in the 3.0 TS finalization issue). This enabled me to subclass StopwordAnalyzerBase and remove heavy code duplication. The original code also contained a bug in the tokenStream method (no setReplaceInvalidAcronym) which was correctin reusableTokenStream. Now it is correct. I will post further patches for core.
          Hide
          Uwe Schindler added a comment -

          Patch that removes deprecated usage of TermAttribute from Lucene Core completely, all tests also fixed.

          Show
          Uwe Schindler added a comment - Patch that removes deprecated usage of TermAttribute from Lucene Core completely, all tests also fixed.
          Hide
          Uwe Schindler added a comment -

          Small updates.

          Just one question: The only non-final Analyzer left is KeywordAnalyzer. If I make it final and also use ReusableTokenizerBase, we can remove the overridesTokenStream check completely? The question is, whoever wants to override this class.

          StandardAnalyzer was made final in this patch, why not also this one?

          Show
          Uwe Schindler added a comment - Small updates. Just one question: The only non-final Analyzer left is KeywordAnalyzer. If I make it final and also use ReusableTokenizerBase, we can remove the overridesTokenStream check completely? The question is, whoever wants to override this class. StandardAnalyzer was made final in this patch, why not also this one?
          Hide
          Mark Miller added a comment -

          If I make it final and

          +1 - lets just remember to add these breaks to the CHANGES BW break section...

          Show
          Mark Miller added a comment - If I make it final and +1 - lets just remember to add these breaks to the CHANGES BW break section...
          Hide
          Michael McCandless added a comment -

          +1 to making KeywordAnalyzer final.

          Show
          Michael McCandless added a comment - +1 to making KeywordAnalyzer final.
          Hide
          Uwe Schindler added a comment -

          Did it already for StandardAna (see patch).

          Show
          Uwe Schindler added a comment - Did it already for StandardAna (see patch).
          Hide
          Uwe Schindler added a comment -

          One more: PerFieldAnalyzerWrapper - Sorry

          Show
          Uwe Schindler added a comment - One more: PerFieldAnalyzerWrapper - Sorry
          Hide
          Uwe Schindler added a comment -

          Updated patch, now also KeywordAnalyzer and PerFieldAnalyzerWrapper made final and the backwards layer removed.

          I will commit this later this day and proceed with contrib. Robert, we should talk who does which one!

          Show
          Uwe Schindler added a comment - Updated patch, now also KeywordAnalyzer and PerFieldAnalyzerWrapper made final and the backwards layer removed. I will commit this later this day and proceed with contrib. Robert, we should talk who does which one!
          Hide
          Uwe Schindler added a comment -

          Updated patch after last commit.

          Show
          Uwe Schindler added a comment - Updated patch after last commit.
          Hide
          Uwe Schindler added a comment -

          Committed core part in revision: 932749

          Show
          Uwe Schindler added a comment - Committed core part in revision: 932749
          Hide
          Robert Muir added a comment -

          here is a start for the remaining ones (what was in contrib/analyzers). theres a few more i didnt do yet, but most are done.

          Show
          Robert Muir added a comment - here is a start for the remaining ones (what was in contrib/analyzers). theres a few more i didnt do yet, but most are done.
          Hide
          Robert Muir added a comment -

          i converted some more usages here... there is still a lot more to do.

          separately, we should open another issue (if there isnt one) for use of other preflex apis in contrib, as i see some uses of TermEnum etc just fixing these up.

          Show
          Robert Muir added a comment - i converted some more usages here... there is still a lot more to do. separately, we should open another issue (if there isnt one) for use of other preflex apis in contrib, as i see some uses of TermEnum etc just fixing these up.
          Hide
          Michael McCandless added a comment -

          separately, we should open another issue (if there isnt one) for use of other preflex apis in contrib, as i see some uses of TermEnum etc just fixing these up.

          We have LUCENE-2378 open for this – I gotta get to it!

          Show
          Michael McCandless added a comment - separately, we should open another issue (if there isnt one) for use of other preflex apis in contrib, as i see some uses of TermEnum etc just fixing these up. We have LUCENE-2378 open for this – I gotta get to it!
          Hide
          Robert Muir added a comment -

          ok, this is the rest of it, minus anything tricky.

          i will leave the hard stuff to uwe

          Show
          Robert Muir added a comment - ok, this is the rest of it, minus anything tricky. i will leave the hard stuff to uwe
          Hide
          Robert Muir added a comment -

          i updated the patch with the problematic ones... we can fix their other problems on LUCENE-2378

          for example, convert MemoryIndex to use byte[], in this patch i only switch it to consume with BytesRef, but it calls utf8ToString() since the whole thing still works with String.

          Show
          Robert Muir added a comment - i updated the patch with the problematic ones... we can fix their other problems on LUCENE-2378 for example, convert MemoryIndex to use byte[], in this patch i only switch it to consume with BytesRef, but it calls utf8ToString() since the whole thing still works with String.
          Hide
          Robert Muir added a comment -

          i updated the patch to remove all uses of deprecated CharTermAttributeImpl methods, such as via Token.term() etc.

          with this patch, we can then remove TermAttribute from trunk completely, by removing deprecations and the sophisticated backwards layer in the indexer.

          Show
          Robert Muir added a comment - i updated the patch to remove all uses of deprecated CharTermAttributeImpl methods, such as via Token.term() etc. with this patch, we can then remove TermAttribute from trunk completely, by removing deprecations and the sophisticated backwards layer in the indexer.
          Hide
          Robert Muir added a comment -

          Committed LUCENE-2372_contrib_solr.patch revisions 950008 (trunk) / 950026 (3x)

          Show
          Robert Muir added a comment - Committed LUCENE-2372 _contrib_solr.patch revisions 950008 (trunk) / 950026 (3x)
          Hide
          Robert Muir added a comment -

          everything is cutover to the new attribute now

          Show
          Robert Muir added a comment - everything is cutover to the new attribute now
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development