Lucene - Core
  1. Lucene - Core
  2. LUCENE-2302

Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array.
      Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence

      I propose to create a new interface "CharTermAttribute" with a clean new API that concentrates on CharSequence and Appendable.
      The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute.

      To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then.

      1. LUCENE-2302-toString.patch
        8 kB
        Uwe Schindler
      2. LUCENE-2302.patch
        53 kB
        Uwe Schindler
      3. LUCENE-2302.patch
        52 kB
        Uwe Schindler
      4. LUCENE-2302.patch
        44 kB
        Uwe Schindler
      5. LUCENE-2302.patch
        44 kB
        Uwe Schindler
      6. LUCENE-2302.patch
        44 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Committed revision: 932369

          Show
          Uwe Schindler added a comment - Committed revision: 932369
          Hide
          Uwe Schindler added a comment -

          Patch that fixes the toString() problems in Token and adds missing CHANGES.txt, fixes backwards tests and updates javadocs to document the "backwards" break.

          Deprecating Token should be done in another issue.

          I will commit this soon, to be able to go forward with tokenstream conversion!

          Show
          Uwe Schindler added a comment - Patch that fixes the toString() problems in Token and adds missing CHANGES.txt, fixes backwards tests and updates javadocs to document the "backwards" break. Deprecating Token should be done in another issue. I will commit this soon, to be able to go forward with tokenstream conversion!
          Hide
          Robert Muir added a comment -

          I will create a patch with option #2 and lots of documentation and changed backwards tests.

          +1. what do we think of deprecating it too?

          I complained about supporting an old "token-like" api easily a while ago, but you added AttributeSource.copyTo,
          which works well (see Solr SynonymsFilter) and does not have Token's limitations.

          Show
          Robert Muir added a comment - I will create a patch with option #2 and lots of documentation and changed backwards tests. +1. what do we think of deprecating it too? I complained about supporting an old "token-like" api easily a while ago, but you added AttributeSource.copyTo, which works well (see Solr SynonymsFilter) and does not have Token's limitations.
          Hide
          Uwe Schindler added a comment -

          I will create a patch with option #2 and lots of documentation and changed backwards tests.

          Show
          Uwe Schindler added a comment - I will create a patch with option #2 and lots of documentation and changed backwards tests.
          Hide
          Robert Muir added a comment -

          Token now implemnts CharSequence but violates its contract

          I don't think this is correct. I don't care at all about Token's .toString method, but i care that the analysis api isn't broken.
          if we do this, then the analysis API is completely wrong when using a Token Attribute Factory.
          In my opinion we should do one of the following two things in the backwards compatibility section, but not break the analysis API:

          1. Token and TokenAttributeFactory was completely removed due to its backwards compatibility problems.
          2. Token's toString method was changed to match the CharSequence interface.
          Show
          Robert Muir added a comment - Token now implemnts CharSequence but violates its contract I don't think this is correct. I don't care at all about Token's .toString method, but i care that the analysis api isn't broken. if we do this, then the analysis API is completely wrong when using a Token Attribute Factory. In my opinion we should do one of the following two things in the backwards compatibility section, but not break the analysis API: Token and TokenAttributeFactory was completely removed due to its backwards compatibility problems. Token's toString method was changed to match the CharSequence interface.
          Hide
          Uwe Schindler added a comment -

          Will add the javadocs and think about the CharSequence problems again. It's tricky

          I have less time at the moment, will do hopefully until the weekend. The same for LUCENE-2354, which needs some test rewriting.

          Show
          Uwe Schindler added a comment - Will add the javadocs and think about the CharSequence problems again. It's tricky I have less time at the moment, will do hopefully until the weekend. The same for LUCENE-2354 , which needs some test rewriting.
          Hide
          Michael McCandless added a comment -

          Uwe is this issue done?

          Show
          Michael McCandless added a comment - Uwe is this issue done?
          Hide
          Uwe Schindler added a comment -

          Add note to backwards compatibility section:

          • TermAttribute now changed toString() behaviour
          • Token now implemnts CharSequence but violates its contract
          Show
          Uwe Schindler added a comment - Add note to backwards compatibility section: TermAttribute now changed toString() behaviour Token now implemnts CharSequence but violates its contract
          Hide
          Uwe Schindler added a comment -

          Was accidently committed with merge. Sorry.

          Revision: 924791

          Show
          Uwe Schindler added a comment - Was accidently committed with merge. Sorry. Revision: 924791
          Hide
          Uwe Schindler added a comment -

          How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the "To" and "Ref").

          This attribute is special, it only has this getter for the bytesref.

          If we need a real "BytesTermAttribute" it should be explicitely defined. Now open is NumericTokenStream and so on

          Show
          Uwe Schindler added a comment - How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the "To" and "Ref"). This attribute is special, it only has this getter for the bytesref. If we need a real "BytesTermAttribute" it should be explicitely defined. Now open is NumericTokenStream and so on
          Hide
          Michael McCandless added a comment -

          I like the name CharTermAttribute.

          How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the "To" and "Ref").

          Show
          Michael McCandless added a comment - I like the name CharTermAttribute. How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the "To" and "Ref").
          Hide
          Uwe Schindler added a comment -

          Updated patch for current flex HEAD. Still backwards needs to be fixed.

          How do we want to preceed?:

          • Name of new Attrubute?
          • Is new CharSeq/Appendable API fine
          • setEmpty()?

          Thanks for reviewing!

          Show
          Uwe Schindler added a comment - Updated patch for current flex HEAD. Still backwards needs to be fixed. How do we want to preceed?: Name of new Attrubute? Is new CharSeq/Appendable API fine setEmpty()? Thanks for reviewing!
          Hide
          Michael McCandless added a comment -

          Patch looks great Uwe! Great simplification that indexer only deals
          in byte[] now for a term. It's agnostic as to whether those bytes are
          utf8 or something else. And it means analyzer chains can now do
          direct binary terms (eg NumericTokenStream).

          Some day... we should also try to have indexer not be responsible for
          creating the TokenStream. Ie it should simply receive, always, an
          AttrSource for a field that needs to be indexed. This puts nice
          distance b/w indexer core and analysis – indexer is then fully
          agnostic to how that AttrSource came to be.

          I see "noncommit" – can you rename to "nocommit" – let's try to be
          consistent

          Maybe rewword "The given AttributeSource has no term attribute" -->
          "Could not find a term attribute (that implements
          TermToBytesAttribute) in the AttributeSource"?

          I think we should rename TermsHashPerField's utf8 var (and in the per
          thread) – it's now just bytes, not necessarily utf8. Maybe termBytes?

          When you temporarily override the length of a too-long term, maybe
          restore it in a try/finally?

          Show
          Michael McCandless added a comment - Patch looks great Uwe! Great simplification that indexer only deals in byte[] now for a term. It's agnostic as to whether those bytes are utf8 or something else. And it means analyzer chains can now do direct binary terms (eg NumericTokenStream). Some day... we should also try to have indexer not be responsible for creating the TokenStream. Ie it should simply receive, always, an AttrSource for a field that needs to be indexed. This puts nice distance b/w indexer core and analysis – indexer is then fully agnostic to how that AttrSource came to be. I see "noncommit" – can you rename to "nocommit" – let's try to be consistent Maybe rewword "The given AttributeSource has no term attribute" --> "Could not find a term attribute (that implements TermToBytesAttribute) in the AttributeSource"? I think we should rename TermsHashPerField's utf8 var (and in the per thread) – it's now just bytes, not necessarily utf8. Maybe termBytes? When you temporarily override the length of a too-long term, maybe restore it in a try/finally?
          Hide
          Uwe Schindler added a comment -

          Removed usage of deprecated API in Token, also javadocs updated.

          Show
          Uwe Schindler added a comment - Removed usage of deprecated API in Token, also javadocs updated.
          Hide
          Uwe Schindler added a comment -

          Again an update, removed the whole lazy init stuff; so initTermBuffer() and additonal checks removed.

          Show
          Uwe Schindler added a comment - Again an update, removed the whole lazy init stuff; so initTermBuffer() and additonal checks removed.
          Hide
          Uwe Schindler added a comment -

          New patch, updated for trunk:

          • CharTermAttributeImpl.subSequence uses now new String, instead CharBuffer.wrap, as it must be a "new sequence", no wrapped
          • Optimized append() to use CharBuffer's array, else just iterates over sequence.
          Show
          Uwe Schindler added a comment - New patch, updated for trunk: CharTermAttributeImpl.subSequence uses now new String, instead CharBuffer.wrap, as it must be a "new sequence", no wrapped Optimized append() to use CharBuffer's array, else just iterates over sequence.
          Hide
          Uwe Schindler added a comment -

          Here a first patch.

          To discuss:

          • Names of classes/interfaces
          • API finalization

          The following applies to the patch:

          • All tests pass
          • You can use old and new TermAttribute in the same TokenStream!
          • If somebody implemented an own TermAttributeImpl (old style), the TermsHashPerField wraps the old attribute using a helper class (per thread). – Test still missing, but worked by commenting out some code that used the new api.

          Currently backwards tests fail because of some checks that are not valid anymore (toString() output of TermAttribute and TestAttributeSource instanceof checks).

          Show
          Uwe Schindler added a comment - Here a first patch. To discuss: Names of classes/interfaces API finalization The following applies to the patch: All tests pass You can use old and new TermAttribute in the same TokenStream! If somebody implemented an own TermAttributeImpl (old style), the TermsHashPerField wraps the old attribute using a helper class (per thread). – Test still missing, but worked by commenting out some code that used the new api. Currently backwards tests fail because of some checks that are not valid anymore (toString() output of TermAttribute and TestAttributeSource instanceof checks).
          Hide
          Michael McCandless added a comment -

          A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef().

          Mike M. noted on LUCENE-1435 that the way to do "internal-to-indexing" collation is to store the original string in the term dictionary, sorted via user-specifiable collation.

          This works in flex now – every field can specify its own Comparator (via the Codec).

          But collation during indexing also works... but'd work better w/ these changes (not have to use indexable binary strings).

          Show
          Michael McCandless added a comment - A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef(). Mike M. noted on LUCENE-1435 that the way to do "internal-to-indexing" collation is to store the original string in the term dictionary, sorted via user-specifiable collation. This works in flex now – every field can specify its own Comparator (via the Codec). But collation during indexing also works... but'd work better w/ these changes (not have to use indexable binary strings).
          Hide
          Michael McCandless added a comment -

          I like that this change would mean indexer has a single getBytes interface for getting the terms data.

          It'd mean the UTF16->UTF8 encoding it now does would move into CharTermAttr, hidden to the indexer.

          So the indexer only ever works with opaque byte[] data for terms.

          And it'd mean others can make their own term attrs – maybe my terms are shorts and I send 2 bytes to indexer per term.

          Show
          Michael McCandless added a comment - I like that this change would mean indexer has a single getBytes interface for getting the terms data. It'd mean the UTF16->UTF8 encoding it now does would move into CharTermAttr, hidden to the indexer. So the indexer only ever works with opaque byte[] data for terms. And it'd mean others can make their own term attrs – maybe my terms are shorts and I send 2 bytes to indexer per term.
          Hide
          Steve Rowe added a comment -

          A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef().

          Mike M. noted on LUCENE-1435 that the way to do "internal-to-indexing" collation is to store the original string in the term dictionary, sorted via user-specifiable collation.

          Show
          Steve Rowe added a comment - A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef(). Mike M. noted on LUCENE-1435 that the way to do "internal-to-indexing" collation is to store the original string in the term dictionary, sorted via user-specifiable collation.
          Hide
          Michael Busch added a comment -

          Hmm maybe this is too much magic? Wouldn't it be simpler to have two completely separate attributes? E.g. CharTermAttribute and ByteTermAttribute. Plus an API in the indexer that specifies which one to use?

          Show
          Michael Busch added a comment - Hmm maybe this is too much magic? Wouldn't it be simpler to have two completely separate attributes? E.g. CharTermAttribute and ByteTermAttribute. Plus an API in the indexer that specifies which one to use?
          Hide
          Uwe Schindler added a comment -

          A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef().

          Show
          Uwe Schindler added a comment - A CollationFilter will not be needed anymore after that change, as any Tokenizer-Chain that wants to use collation can simply supply a special AttributeFactory to the ctor, that creates a special TermAttributeImpl class with modified getBytesRef().

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development