Lucene - Core
  1. Lucene - Core
  2. LUCENE-2125

Ability to store and retrieve attributes in the inverted index

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that we have the cool attribute-based TokenStream API and also the
      great new flexible indexing features, the next logical step is to
      allow storing the attributes inline in the posting lists. Currently
      this is only supported for the PayloadAttribute.

      The flex search APIs already provide an AttributeSource, so there will
      be a very clean and performant symmetry. It should be seamlessly
      possible for the user to define a new attribute, add it to the
      TokenStream, and then retrieve it from the flex search APIs.

      What I'm planning to do is to add additional methods to the token
      attributes (e.g. by adding a new class TokenAttributeImpl, which
      extends AttributeImpl and is the super class of all impls in
      o.a.l.a.tokenattributes):

      • void serialize(DataOutput)
      • void deserialize(DataInput)
      • boolean storeInIndex()

      The indexer will only call the serialize method of an
      TokenAttributeImpl in case its storeInIndex() returns true.

      The big advantage here is the ease-of-use: A user can implement in one
      place everything necessary to add the attribute to the index.

      Btw: I'd like to introduce DataOutput and DataInput as super classes
      of IndexOutput and IndexInput. They will contain methods like
      readByte(), readVInt(), etc., but methods such as close(),
      getFilePointer() etc. will stay in the super classes.

      Currently the payload concept is hardcoded in
      TermsHashPerField and FreqProxTermsWriterPerField. These classes take
      care of copying the contents of the PayloadAttribute over into the
      intermediate in-memory postinglist representation and reading it
      again. Ideally these classes should not know about specific
      attributes, but only call serialze() on those attributes that shall
      be stored in the posting list.

      We also need to change the PositionsEnum and PositionsConsumer APIs to
      deal with attributes instead of payloads.

      I think the new codecs should all support storing attributes. Only the
      preflex one should be hardcoded to only take the PayloadAttribute into
      account.

      We'll possibly need another extension point that allows us to influence
      compression across multiple postings. Today we use the
      length-compression trick for the payloads: if the previous payload had
      the same length as the current one, we don't store the length
      explicitly again, but only set a bit in the shifted position VInt. Since
      often all payloads of one posting list have the same length, this
      results in effective compression.
      Now an advanced user might want to implement a similar encoding, where
      it's not enough to just control serialization of a single value, but
      where e.g. the previous position can be taken into account to decide
      how to encode a value.
      I'm not sure yet how this extension point should look like. Maybe the
      flex APIs are actually already sufficient.

      One major goal of this feature is performance: It ought to be more
      efficient to e.g. define an attribute that writes and reads a single
      VInt than storing that VInt as a payload. The payload has the overhead
      of converting the data into a byte array first. An attribute on the other
      hand should be able to call 'int value = dataInput.readVInt();' directly
      without the byte[] indirection.

      After this part is done I'd like to use a very similar approach for
      column-stride fields.

        Activity

        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Michael McCandless added a comment -

        One question...

        Say I make an attr and it serializes to variable number of bytes, per
        position.

        How can we design serializer API so that this attr can do the same
        encoding trick we do with payload today?

        Ie where we steal 1 bit from the pos-delta to state whether the length
        changed from last time?

        If we can do this then I think we should remove payload from the
        flex postings API and just move directly to attrs?

        Though: how, also, should we encode more than 1 attr per position?
        Can we somehow make this the responsibility of the serializer (if we
        can somehow get one serializer for all attrs that are serializable in
        the source)?

        This way a user could make their own serializer if they know
        interesting things about the attrs they need to serialize. EG maybe
        either A or B needs to be serialized but never both... or C only is
        serialized if A is not null... etc.

        If we do this then from Lucene's standpoint the serialization will
        "feel" just like payload feels today – an optional byte[] that may or
        may not be variable length.

        Show
        Michael McCandless added a comment - One question... Say I make an attr and it serializes to variable number of bytes, per position. How can we design serializer API so that this attr can do the same encoding trick we do with payload today? Ie where we steal 1 bit from the pos-delta to state whether the length changed from last time? If we can do this then I think we should remove payload from the flex postings API and just move directly to attrs? Though: how, also, should we encode more than 1 attr per position? Can we somehow make this the responsibility of the serializer (if we can somehow get one serializer for all attrs that are serializable in the source)? This way a user could make their own serializer if they know interesting things about the attrs they need to serialize. EG maybe either A or B needs to be serialized but never both... or C only is serialized if A is not null... etc. If we do this then from Lucene's standpoint the serialization will "feel" just like payload feels today – an optional byte[] that may or may not be variable length.
        Hide
        Michael McCandless added a comment -

        We may need to allow for stateful serializers?

        EG (contribed example) imagine an attr that stays the same for most
        docs, so, attr writes 1 byte for "it's the same or not" and then many
        bytes when there is a change. The serializer will want to remember
        last value it wrote? (Hmm though I guess attr could also eg keep a
        bit inside noting that it had changed on the last call to .next(), as
        well). (The payload encoding length only when length changes is a
        similar example, but, this encoding "takes avantage" of being deeply
        tied to the codec since that bit is merged with the position length
        delta.)

        Or imagine writing strings to the index, but the strings have dups,
        yet you don't know the full universe of strings up front. So you make
        a dict as you go (first time you see a string you assign it the next
        int). This case goes beyond first one because this dict must be
        saved on .close() (maybe optionally taking a different DataOutput to
        save its state to), and, codec must remember which file that attr had
        been .close()d on so that at read time it can seek there and init the
        stateful deserializer (which should be lazy... ie if you don't request
        the attr it shouldn't load the dict).

        Also: codec would need to know if serialization is fixed width... or
        maybe expose a .skip() method on deserializer? EG I may be enuming
        only docs/positions but not attrs (say, running a normal PhraseQuery),
        and I want to just skip (like how we skip payload today when its not
        read).

        I wonder if StandardCodec should inline serialized attrs into existing
        postings lists, or, make separate file to hold them...?

        Show
        Michael McCandless added a comment - We may need to allow for stateful serializers? EG (contribed example) imagine an attr that stays the same for most docs, so, attr writes 1 byte for "it's the same or not" and then many bytes when there is a change. The serializer will want to remember last value it wrote? (Hmm though I guess attr could also eg keep a bit inside noting that it had changed on the last call to .next(), as well). (The payload encoding length only when length changes is a similar example, but, this encoding "takes avantage" of being deeply tied to the codec since that bit is merged with the position length delta.) Or imagine writing strings to the index, but the strings have dups, yet you don't know the full universe of strings up front. So you make a dict as you go (first time you see a string you assign it the next int). This case goes beyond first one because this dict must be saved on .close() (maybe optionally taking a different DataOutput to save its state to), and, codec must remember which file that attr had been .close()d on so that at read time it can seek there and init the stateful deserializer (which should be lazy... ie if you don't request the attr it shouldn't load the dict). Also: codec would need to know if serialization is fixed width... or maybe expose a .skip() method on deserializer? EG I may be enuming only docs/positions but not attrs (say, running a normal PhraseQuery), and I want to just skip (like how we skip payload today when its not read). I wonder if StandardCodec should inline serialized attrs into existing postings lists, or, make separate file to hold them...?
        Hide
        Uwe Schindler added a comment -

        I would prefer to not extend AttributeImpl but more make the attribute simply extend another interface: SerializableAttribute that provides input/output methods. Docinverter can then just check with instanceof, if the attribute is to be stored in index.

        This would also help with ProxyAttributes (LUCENE-2154).

        Show
        Uwe Schindler added a comment - I would prefer to not extend AttributeImpl but more make the attribute simply extend another interface: SerializableAttribute that provides input/output methods. Docinverter can then just check with instanceof, if the attribute is to be stored in index. This would also help with ProxyAttributes ( LUCENE-2154 ).
        Hide
        Michael McCandless added a comment -

        Yes, and then I can also close LUCENE-1585!

        Actually, flex now gives your codec a start, here – the merge has been refactored onto the Fields/Terms/Docs/PositionsEnum base classes. This means you can make a codec that overrides how positions are merged, to change what's done with the payloads.

        But, the solution proposed in this issue takes it further (better) – you shouldn't have to override all of positions merging just because one attr (payloads, or another) wants control over how it's merged.

        Show
        Michael McCandless added a comment - Yes, and then I can also close LUCENE-1585 ! Actually, flex now gives your codec a start, here – the merge has been refactored onto the Fields/Terms/Docs/PositionsEnum base classes. This means you can make a codec that overrides how positions are merged, to change what's done with the payloads. But, the solution proposed in this issue takes it further (better) – you shouldn't have to override all of positions merging just because one attr (payloads, or another) wants control over how it's merged.
        Hide
        Michael McCandless added a comment -

        I wonder if we need to allow codecs to store data into SegmentInfo/FieldInfo for this (we don't now).

        IMO we definitely do. E.g. for backwards-compatibility: if users switch the encoding
        of an attribute, then they need a way to determine in which format it is stored in a
        given segment.

        And we need to open up FieldInfo too: it has to store which and in what order the
        attributes are stored.

        I'm sure these are the things you had in mind too?

        Well... some stuff should be written into the header of each file, so eg a switch to encoding could be handled by the simple versioning the Codec API gives you (Codec.writeHeader/Codec.checkHeader).

        But, yeah, for other stuff I've been assuming we need to open up Segment/FieldInfo.

        So eg "omitTermFreqAndPositions" is something we could conceivably put under codec control, ie, Lucene core shouldn't need to know this attr even exists. But, then we'd need extensibility of Field as well. We've discussed splitting this setting, to separately control whether the freq is written and whether the positions are written, which makes complete sense. It'd be great if such a change could be cleanly handled by simply creating a new version of the codec. Likewise, "hasProx", which is derived from the omitTFAPs of all fields within the segment, should be computed/managed entirely within the codec.

        Show
        Michael McCandless added a comment - I wonder if we need to allow codecs to store data into SegmentInfo/FieldInfo for this (we don't now). IMO we definitely do. E.g. for backwards-compatibility: if users switch the encoding of an attribute, then they need a way to determine in which format it is stored in a given segment. And we need to open up FieldInfo too: it has to store which and in what order the attributes are stored. I'm sure these are the things you had in mind too? Well... some stuff should be written into the header of each file, so eg a switch to encoding could be handled by the simple versioning the Codec API gives you (Codec.writeHeader/Codec.checkHeader). But, yeah, for other stuff I've been assuming we need to open up Segment/FieldInfo. So eg "omitTermFreqAndPositions" is something we could conceivably put under codec control, ie, Lucene core shouldn't need to know this attr even exists. But, then we'd need extensibility of Field as well. We've discussed splitting this setting, to separately control whether the freq is written and whether the positions are written, which makes complete sense. It'd be great if such a change could be cleanly handled by simply creating a new version of the codec. Likewise, "hasProx", which is derived from the omitTFAPs of all fields within the segment, should be computed/managed entirely within the codec.
        Hide
        Michael Busch added a comment -

        BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case.

        Yes, and then I can also close LUCENE-1585!

        Show
        Michael Busch added a comment - BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case. Yes, and then I can also close LUCENE-1585 !
        Hide
        Michael McCandless added a comment -

        I think it makes sense to not treat payloads specially in flex, ie, make it an attr.

        Hmm, so the concern is that people have to make the switch to the flex APIs
        after upgrading to the next Lucene version if they want to create indexes with
        good old payloads?

        Well, not really – if you stick payloads into your tokens during analysis, presumably the standard (= default) codec would recognize the new payload attr, and store it like normal. Then, any existing queries that do interesting things w/ payloads (PayloadNear/TermQuery), we'd cutover to the new API, and your custom Similarity would still be invoked?

        It's only if you directly access TermPositions's payload API today, that you'd have to migrate to the new API?

        But, even then, flex does back compat emulation, so a new index written with the standard codec could be accessed via the old API.

        BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case.

        Show
        Michael McCandless added a comment - I think it makes sense to not treat payloads specially in flex, ie, make it an attr. Hmm, so the concern is that people have to make the switch to the flex APIs after upgrading to the next Lucene version if they want to create indexes with good old payloads? Well, not really – if you stick payloads into your tokens during analysis, presumably the standard (= default) codec would recognize the new payload attr, and store it like normal. Then, any existing queries that do interesting things w/ payloads (PayloadNear/TermQuery), we'd cutover to the new API, and your custom Similarity would still be invoked? It's only if you directly access TermPositions's payload API today, that you'd have to migrate to the new API? But, even then, flex does back compat emulation, so a new index written with the standard codec could be accessed via the old API. BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case.
        Hide
        Michael Busch added a comment -

        So you'd remove the explicit payload methods in PositionsEnum? Ie,
        users on migrating to flex would have to switch to the payloads
        attribute?

        I think that would make sense? Payloads don't have to be treated specially anymore,
        if any attribute can be stored in the posting lists.

        Note the that preflex codec only has a reader (FieldsProducer), not a
        writer. Ie you can read the old index format but not write it.

        Hmm, so the concern is that people have to make the switch to the flex APIs
        after upgrading to the next Lucene version if they want to create indexes with
        good old payloads?

        Ideally the serialize/unserialize could efficiently handle the
        fixed-length case without using up the 1 bit in the index.

        Yes!

        I wonder if we need to allow codecs to store data into
        SegmentInfo/FieldInfo for this (we don't now).

        IMO we definitely do. E.g. for backwards-compatibility: if users switch the encoding
        of an attribute, then they need a way to determine in which format it is stored in a
        given segment.

        And we need to open up FieldInfo too: it has to store which and in what order the
        attributes are stored.

        I'm sure these are the things you had in mind too?

        Show
        Michael Busch added a comment - So you'd remove the explicit payload methods in PositionsEnum? Ie, users on migrating to flex would have to switch to the payloads attribute? I think that would make sense? Payloads don't have to be treated specially anymore, if any attribute can be stored in the posting lists. Note the that preflex codec only has a reader (FieldsProducer), not a writer. Ie you can read the old index format but not write it. Hmm, so the concern is that people have to make the switch to the flex APIs after upgrading to the next Lucene version if they want to create indexes with good old payloads? Ideally the serialize/unserialize could efficiently handle the fixed-length case without using up the 1 bit in the index. Yes! I wonder if we need to allow codecs to store data into SegmentInfo/FieldInfo for this (we don't now). IMO we definitely do. E.g. for backwards-compatibility: if users switch the encoding of an attribute, then they need a way to determine in which format it is stored in a given segment. And we need to open up FieldInfo too: it has to store which and in what order the attributes are stored. I'm sure these are the things you had in mind too?
        Hide
        Michael McCandless added a comment -

        This sounds great – and is the logical next step for flex.

        So you'd remove the explicit payload methods in PositionsEnum? Ie,
        users on migrating to flex would have to switch to the payloads
        attribute?

        Note the that preflex codec only has a reader (FieldsProducer), not a
        writer. Ie you can read the old index format but not write it.

        Ideally the serialize/unserialize could efficiently handle the
        fixed-length case without using up the 1 bit in the index.

        I wonder if we need to allow codecs to store data into
        SegmentInfo/FieldInfo for this (we don't now).

        Show
        Michael McCandless added a comment - This sounds great – and is the logical next step for flex. So you'd remove the explicit payload methods in PositionsEnum? Ie, users on migrating to flex would have to switch to the payloads attribute? Note the that preflex codec only has a reader (FieldsProducer), not a writer. Ie you can read the old index format but not write it. Ideally the serialize/unserialize could efficiently handle the fixed-length case without using up the 1 bit in the index. I wonder if we need to allow codecs to store data into SegmentInfo/FieldInfo for this (we don't now).

          People

          • Assignee:
            Michael Busch
            Reporter:
            Michael Busch
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development