Lucene - Core
  1. Lucene - Core
  2. LUCENE-5989

Allow StringField to take BytesRef value, to index a single binary token

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
      lowest levels of Lucene (the codec APIs) yet today, actually adding an
      arbitrary byte[] binary term during indexing is far from simple: you
      must make a custom Field with a custom TokenStream and a custom
      TermToBytesRefAttribute, as far as I know.

      This is supremely expert, I wonder if anyone out there has succeeded
      in doing so?

      I think we should make indexing a single byte[] as simple as indexing
      a single String.

      This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
      address as byte[16]) and LUCENE-5879 (encoding native numeric values
      in their simple binary form).

      1. LUCENE-5989.patch
        34 kB
        Michael McCandless
      2. LUCENE-5989.patch
        16 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, adding BinaryField... the field types are the same as StringField so
        we could maybe instead just add a ctor for StringField taking
        BytesRef...

        Show
        Michael McCandless added a comment - Patch, adding BinaryField... the field types are the same as StringField so we could maybe instead just add a ctor for StringField taking BytesRef...
        Hide
        Shai Erera added a comment - - edited

        we could maybe instead just add a ctor for StringField taking BytesRef...

        We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordField can take a String or BytesRef in ctors.

        Show
        Shai Erera added a comment - - edited we could maybe instead just add a ctor for StringField taking BytesRef... We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordField can take a String or BytesRef in ctors.
        Hide
        Uwe Schindler added a comment -

        We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors.

        +1

        Patch, adding BinaryField

        I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes.

        This is not urgent here, but it violates the contract. I know NumericTermAttribute does similar things...

        Show
        Uwe Schindler added a comment - We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. +1 Patch, adding BinaryField I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes. This is not urgent here, but it violates the contract. I know NumericTermAttribute does similar things...
        Hide
        Jack Krupansky added a comment -

        rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors.

        Both Lucene and Solr are suffering from a conflation of the two concepts of treating an input stream as a single token ("a keyword") and as a sequence of tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT tokenize the input stream into "a sequence of keywords". The term "keyword search" is commonly used to describe the ability of search engines to find "individual keywords" in extended streams of "text" - a clear reference to "keyword" in a tokenized stream.

        So, I don't understand how it is claimed that naming StringField to KeywordField is making anything "obvious" - it seems to me to be adding to the existing confusion rather than clarifying anything. I mean, the term "keyword" should be treated more as a synonym for "token" or "term", NOT as synonym for "string" or "raw character sequence".

        I agree that we need a term for "raw, uninterpreted character sequence", but it seems to me that "string" is a more "obvious" candidate than "keyword".

        There has been some grumbling at the Solr level that KeywordTokenizer should be renamed to... something, anything, but just not KeywordTokenizer, which "obviously" implied that the input stream will be tokenized into a sequence of keywords, which it does not.

        In an effort to try to resolve this ongoing confusion, can somebody provide from historical background as to how KeywordTokenizer got its name, and how a subset of people continue to refer to an uninterpreted sequence of characters as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the source code, but came up empty.

        In short, it is a real eye-opener to see a claim that the term "keyword" in any way makes it "obvious" that input is not tokenized!!

        Maybe we could fix this for 5.0 to have a cleaner set of terminology going forward. At a minimum, we should have some clarifying language in the Javadoc. And hopefully we can refrain from making the confusion/conflation worse by renaming StringField to KeywordField.

        Then a KeywordsField can take a String

        Is that simply a typo or is the intent to have both a KeywordField (singular) and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a Freudian slip and highlights this semantic difficulty that persists in the Lucene terminology (and hence infects Solr terminology as well.)

        Show
        Jack Krupansky added a comment - rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. Both Lucene and Solr are suffering from a conflation of the two concepts of treating an input stream as a single token ("a keyword") and as a sequence of tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT tokenize the input stream into "a sequence of keywords". The term "keyword search" is commonly used to describe the ability of search engines to find "individual keywords" in extended streams of "text" - a clear reference to "keyword" in a tokenized stream. So, I don't understand how it is claimed that naming StringField to KeywordField is making anything "obvious" - it seems to me to be adding to the existing confusion rather than clarifying anything. I mean, the term "keyword" should be treated more as a synonym for "token" or "term", NOT as synonym for "string" or "raw character sequence". I agree that we need a term for "raw, uninterpreted character sequence", but it seems to me that "string" is a more "obvious" candidate than "keyword". There has been some grumbling at the Solr level that KeywordTokenizer should be renamed to... something, anything, but just not KeywordTokenizer, which "obviously" implied that the input stream will be tokenized into a sequence of keywords, which it does not. In an effort to try to resolve this ongoing confusion, can somebody provide from historical background as to how KeywordTokenizer got its name, and how a subset of people continue to refer to an uninterpreted sequence of characters as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the source code, but came up empty. In short, it is a real eye-opener to see a claim that the term "keyword" in any way makes it "obvious" that input is not tokenized!! Maybe we could fix this for 5.0 to have a cleaner set of terminology going forward. At a minimum, we should have some clarifying language in the Javadoc. And hopefully we can refrain from making the confusion/conflation worse by renaming StringField to KeywordField. Then a KeywordsField can take a String Is that simply a typo or is the intent to have both a KeywordField (singular) and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a Freudian slip and highlights this semantic difficulty that persists in the Lucene terminology (and hence infects Solr terminology as well.)
        Hide
        Robert Muir added a comment -

        This is supremely expert, I wonder if anyone out there has succeeded
        in doing so?

        So the solution is to proceed and make matters worse by requiring the user to also deal with the .document API?

        If the user wants their field to work with various query-time features (queryparser, morelikethis, whatever), then they must deal with the tokenstream side anyway, so adding *Field doesn't help anything. It just adds yet another place they must plug in "schema" information (as opposed to only being once in Analyzer). Sure, its easier to get past indexwriter maybe, but you win the battle and lose the war.

        I'm not going to try to block the change, just please, please, please think about it.

        Show
        Robert Muir added a comment - This is supremely expert, I wonder if anyone out there has succeeded in doing so? So the solution is to proceed and make matters worse by requiring the user to also deal with the .document API? If the user wants their field to work with various query-time features (queryparser, morelikethis, whatever), then they must deal with the tokenstream side anyway, so adding *Field doesn't help anything. It just adds yet another place they must plug in "schema" information (as opposed to only being once in Analyzer). Sure, its easier to get past indexwriter maybe, but you win the battle and lose the war. I'm not going to try to block the change, just please, please, please think about it.
        Hide
        Shai Erera added a comment -

        Is that simply a typo

        Yes, fixed .

        The term 'keyword' is of course overloaded here. When I propose KeywordField, I am following the existing Keyword* classes that we have: KeywordTokenizer, KeywordAnalyzer, KeywordAttribute. And from what I remember, when users ask how to parse 'keywords' they indexed as StringFields, we often tell them to use PerFieldAnalyzerWrapper with a KeywordAnalyzer for that field. That's why I feel that KeywordField fits better with the overall Keyword* tokenstream API.

        Show
        Shai Erera added a comment - Is that simply a typo Yes, fixed . The term 'keyword' is of course overloaded here. When I propose KeywordField, I am following the existing Keyword* classes that we have: KeywordTokenizer, KeywordAnalyzer, KeywordAttribute. And from what I remember, when users ask how to parse 'keywords' they indexed as StringFields, we often tell them to use PerFieldAnalyzerWrapper with a KeywordAnalyzer for that field. That's why I feel that KeywordField fits better with the overall Keyword* tokenstream API.
        Hide
        David Smiley added a comment -

        This is supremely expert, I wonder if anyone out there has succeeded in doing so?

        org.apache.lucene.spatial.prefix.CellTokenStream Though this doesn't count since it's in Lucene.

        +1 to make this easier via a BinaryField. With BinaryField and auto-prefixing, CellTokenStream won't be needed for indexing a point. But it's needed for other shapes and to support heat-map style faceting.

        Jack's opinion about the "Keyword" name being far from obvious really resonated with me. Despite Shai's reasonable explanation, it doesn't seem to me that changing the status-quo to anything non-obvious is helpful. And it wouldn't seem like the text equivalent of BinaryField – for that the current name is perfect, I think. But I do like the idea of simply having StringField taking a byte[] too such that there is no BinaryField. Either way.

        Show
        David Smiley added a comment - This is supremely expert, I wonder if anyone out there has succeeded in doing so? org.apache.lucene.spatial.prefix.CellTokenStream Though this doesn't count since it's in Lucene. +1 to make this easier via a BinaryField. With BinaryField and auto-prefixing, CellTokenStream won't be needed for indexing a point. But it's needed for other shapes and to support heat-map style faceting. Jack's opinion about the "Keyword" name being far from obvious really resonated with me. Despite Shai's reasonable explanation, it doesn't seem to me that changing the status-quo to anything non-obvious is helpful. And it wouldn't seem like the text equivalent of BinaryField – for that the current name is perfect, I think. But I do like the idea of simply having StringField taking a byte[] too such that there is no BinaryField. Either way.
        Hide
        Michael McCandless added a comment -

        I have also never understood the origin of "keyword" meaning the
        entire string is treated as one token. I don't think it's obvious.
        It is consistent with the existing KeywordAnalyzer/Tokenizer, but I
        don't think that's a good justification to further propagate non-obvious
        naming. I would rather rename KeywordTokenizer/Analyzer to
        something else...

        I guess net/net I would prefer here that we not add BinaryField and
        instead keep the name StringField, just giving it another ctor to take
        byte[]/BytesRef. Added classes have an API cost higher than just an
        added ctor, and the "purpose" of these two is exactly the same...

        I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes.

        Thanks Uwe, I'll add a nocommit to somehow fix it ... seems like
        ByteTermAttributeImpl.clear must null out its copy of the bytes, and
        then BinaryTokenStream.reset must re-instate the next one (pulling it
        via the previous setValue call?). I guess I must add
        BinaryTokenStream.bytes too? Our analysis APIs are ... challenging.

        So the solution is to proceed and make matters worse by requiring the user to also deal with the .document API?

        But if you can't even figure out how to get your IPv6 byte[]
        (LUCENE-5596) or your numeric value encoded as byte[4] or byte[8]
        (LUCENE-5879) into Lucene's IndexWriter in the first place, how will
        you even have any hope of querying it?

        Show
        Michael McCandless added a comment - I have also never understood the origin of "keyword" meaning the entire string is treated as one token. I don't think it's obvious. It is consistent with the existing KeywordAnalyzer/Tokenizer, but I don't think that's a good justification to further propagate non-obvious naming. I would rather rename KeywordTokenizer/Analyzer to something else... I guess net/net I would prefer here that we not add BinaryField and instead keep the name StringField, just giving it another ctor to take byte[]/BytesRef. Added classes have an API cost higher than just an added ctor, and the "purpose" of these two is exactly the same... I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes. Thanks Uwe, I'll add a nocommit to somehow fix it ... seems like ByteTermAttributeImpl.clear must null out its copy of the bytes, and then BinaryTokenStream.reset must re-instate the next one (pulling it via the previous setValue call?). I guess I must add BinaryTokenStream.bytes too? Our analysis APIs are ... challenging. So the solution is to proceed and make matters worse by requiring the user to also deal with the .document API? But if you can't even figure out how to get your IPv6 byte[] ( LUCENE-5596 ) or your numeric value encoded as byte[4] or byte[8] ( LUCENE-5879 ) into Lucene's IndexWriter in the first place, how will you even have any hope of querying it?
        Hide
        Robert Muir added a comment -

        Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address?

        I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too)

        ... AND address:"1.2::3.4" 
        

        Otherwise, if we don't want to make binary/numeric/etc fields "first class", and only treat them as bastardizations of text fields, then please, do this consistently everywhere, parse them as text everywhere, so that they will work correctly everywhere.

        Show
        Robert Muir added a comment - Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address? I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too) ... AND address: "1.2::3.4" Otherwise, if we don't want to make binary/numeric/etc fields "first class", and only treat them as bastardizations of text fields, then please, do this consistently everywhere, parse them as text everywhere, so that they will work correctly everywhere.
        Hide
        Michael McCandless added a comment -

        Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address?

        I agree: taking Inet6Address would be great, but under the hood it should be indexed as a byte[] (which this issue is trying to enable), right? I'll go reopen LUCENE-5596...

        I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too)

        +1, but that's really out of scope here? I mean I don't think we can solve all of Lucene's "schema" issues here. Seems like LUCENE-5596 should make it easy to do IP address range querying...

        Show
        Michael McCandless added a comment - Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address? I agree: taking Inet6Address would be great, but under the hood it should be indexed as a byte[] (which this issue is trying to enable), right? I'll go reopen LUCENE-5596 ... I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too) +1, but that's really out of scope here? I mean I don't think we can solve all of Lucene's "schema" issues here. Seems like LUCENE-5596 should make it easy to do IP address range querying...
        Hide
        Michael McCandless added a comment -

        I think now that LUCENE-5879 is in, as a dark/unusable feature, we should add BinaryField (this issue) to at least shed a bit of light on it.

        With BinaryField, apps can efficiently prefix and range search anything they can convert to/from byte[], e.g. BigInteger/Decimal, InetAddress (LUCENE-5596), int/long/float/double, etc. On the LUCENE-6005 branch there is also a half-precision float (2 bytes).

        I don't think Lucene's lack of schema is a justifiable reason to block progress here.

        Show
        Michael McCandless added a comment - I think now that LUCENE-5879 is in, as a dark/unusable feature, we should add BinaryField (this issue) to at least shed a bit of light on it. With BinaryField, apps can efficiently prefix and range search anything they can convert to/from byte[], e.g. BigInteger/Decimal, InetAddress ( LUCENE-5596 ), int/long/float/double, etc. On the LUCENE-6005 branch there is also a half-precision float (2 bytes). I don't think Lucene's lack of schema is a justifiable reason to block progress here.
        Hide
        Michael McCandless added a comment -

        I'll switch to just adding a StringField ctor that takes a BytesRef ... less API.

        Show
        Michael McCandless added a comment - I'll switch to just adding a StringField ctor that takes a BytesRef ... less API.
        Hide
        Robert Muir added a comment -

        If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code.

        Currently the StoredFieldsVisitor returns strings as java.lang.String, which is wasteful for the default merge implementation (it must decode/re-encode). If we could remove this and let the visitor deal with it, default merge could avoid this decode/re-encode and we might be able to even nuke some specialized bulk merge logic that we have solely for reasons like this (at the least we will speed up the worst case). I tried to look at this recently and the .document api stopped me.

        Not something we have to fix here, but just something related to think about when looking at how to change it.

        Show
        Robert Muir added a comment - If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code. Currently the StoredFieldsVisitor returns strings as java.lang.String, which is wasteful for the default merge implementation (it must decode/re-encode). If we could remove this and let the visitor deal with it, default merge could avoid this decode/re-encode and we might be able to even nuke some specialized bulk merge logic that we have solely for reasons like this (at the least we will speed up the worst case). I tried to look at this recently and the .document api stopped me. Not something we have to fix here, but just something related to think about when looking at how to change it.
        Hide
        Michael McCandless added a comment -

        If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code.

        This would be very nice ... I struggled some with it, but got stuck with StorableField.stringValue() returning String. I think we need to keep that because that's also the API apps use to retrieve their stored fields.

        But the default merging operates on StorableDocument/StorableField, so I'm not sure how to separate the two. Really there are two concepts: the "schema" for this doc (did it store a binary or string value for this field), and what's used to represent a string value (byte[] vs String), and both concepts are being smooshed together into this API.

        Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ...

        Show
        Michael McCandless added a comment - If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code. This would be very nice ... I struggled some with it, but got stuck with StorableField.stringValue() returning String. I think we need to keep that because that's also the API apps use to retrieve their stored fields. But the default merging operates on StorableDocument/StorableField, so I'm not sure how to separate the two. Really there are two concepts: the "schema" for this doc (did it store a binary or string value for this field), and what's used to represent a string value (byte[] vs String), and both concepts are being smooshed together into this API. Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ...
        Hide
        Michael McCandless added a comment -

        New patch, just adding another ctor to StringField taking BytesRef.

        I also cutover StoredFieldVisitor.stringField from String -> byte[] and put TODOs to try to eliminate the conversion that default stored fields merge impl does.

        I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but ran into problems because this BytesRef is pre-shared up front to consumers, so we can't null it in clear...

        I think it's ready.

        Show
        Michael McCandless added a comment - New patch, just adding another ctor to StringField taking BytesRef. I also cutover StoredFieldVisitor.stringField from String -> byte[] and put TODOs to try to eliminate the conversion that default stored fields merge impl does. I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but ran into problems because this BytesRef is pre-shared up front to consumers, so we can't null it in clear... I think it's ready.
        Hide
        Robert Muir added a comment -

        Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ...

        Thanks for looking into it. Maybe we can remove the smooshing in a separate issue. I think its really bad that our default merge impl creates so many strings.

        Show
        Robert Muir added a comment - Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ... Thanks for looking into it. Maybe we can remove the smooshing in a separate issue. I think its really bad that our default merge impl creates so many strings.
        Hide
        Uwe Schindler added a comment -

        I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but ran into problems because this BytesRef is pre-shared up front to consumers, so we can't null it in clear...

        I don't think this is a problem here, because the TokenStream is only used internally and is never visible to the outside (isn't it?). Another thing is that Attribute's copyTo() does not deep clone, but this is also not an issue (because nobody has the chance to copy this tokenstream anywhere else). Shai Erera and I fixed TokensStreams in another issue, where payloads were not cloned (see changelog, don't have issue number).

        In general we should fix the TermToBytesRefAttribute and remove the horrible fillBytesRef, which was needed in Lucene 4.x because of some early Lucene 3 compatibility. But it makes it hard to use, so we should get rid of it. TermToBytesRefAttribute should only have a single method: getBytesRef() that returns the BytesRef.

        Generally I am fine. The issues Robert mentioned should be done in a separate issue.

        Show
        Uwe Schindler added a comment - I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but ran into problems because this BytesRef is pre-shared up front to consumers, so we can't null it in clear... I don't think this is a problem here, because the TokenStream is only used internally and is never visible to the outside (isn't it?). Another thing is that Attribute's copyTo() does not deep clone, but this is also not an issue (because nobody has the chance to copy this tokenstream anywhere else). Shai Erera and I fixed TokensStreams in another issue, where payloads were not cloned (see changelog, don't have issue number). In general we should fix the TermToBytesRefAttribute and remove the horrible fillBytesRef, which was needed in Lucene 4.x because of some early Lucene 3 compatibility. But it makes it hard to use, so we should get rid of it. TermToBytesRefAttribute should only have a single method: getBytesRef() that returns the BytesRef. Generally I am fine. The issues Robert mentioned should be done in a separate issue.
        Hide
        ASF subversion and git services added a comment -

        Commit 1672781 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1672781 ]

        LUCENE-5989: allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens

        Show
        ASF subversion and git services added a comment - Commit 1672781 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1672781 ] LUCENE-5989 : allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens
        Hide
        ASF subversion and git services added a comment -

        Commit 1672842 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1672842 ]

        LUCENE-5989: allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens

        Show
        ASF subversion and git services added a comment - Commit 1672842 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1672842 ] LUCENE-5989 : allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens
        Hide
        ASF subversion and git services added a comment -

        Commit 1672843 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1672843 ]

        LUCENE-5989: fix CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1672843 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1672843 ] LUCENE-5989 : fix CHANGES entry
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development