Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2125

Ability to store and retrieve attributes in the inverted index


    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 4.0-ALPHA
    • 4.9, 6.0
    • core/index
    • None
    • New


      Now that we have the cool attribute-based TokenStream API and also the
      great new flexible indexing features, the next logical step is to
      allow storing the attributes inline in the posting lists. Currently
      this is only supported for the PayloadAttribute.

      The flex search APIs already provide an AttributeSource, so there will
      be a very clean and performant symmetry. It should be seamlessly
      possible for the user to define a new attribute, add it to the
      TokenStream, and then retrieve it from the flex search APIs.

      What I'm planning to do is to add additional methods to the token
      attributes (e.g. by adding a new class TokenAttributeImpl, which
      extends AttributeImpl and is the super class of all impls in

      • void serialize(DataOutput)
      • void deserialize(DataInput)
      • boolean storeInIndex()

      The indexer will only call the serialize method of an
      TokenAttributeImpl in case its storeInIndex() returns true.

      The big advantage here is the ease-of-use: A user can implement in one
      place everything necessary to add the attribute to the index.

      Btw: I'd like to introduce DataOutput and DataInput as super classes
      of IndexOutput and IndexInput. They will contain methods like
      readByte(), readVInt(), etc., but methods such as close(),
      getFilePointer() etc. will stay in the super classes.

      Currently the payload concept is hardcoded in
      TermsHashPerField and FreqProxTermsWriterPerField. These classes take
      care of copying the contents of the PayloadAttribute over into the
      intermediate in-memory postinglist representation and reading it
      again. Ideally these classes should not know about specific
      attributes, but only call serialze() on those attributes that shall
      be stored in the posting list.

      We also need to change the PositionsEnum and PositionsConsumer APIs to
      deal with attributes instead of payloads.

      I think the new codecs should all support storing attributes. Only the
      preflex one should be hardcoded to only take the PayloadAttribute into

      We'll possibly need another extension point that allows us to influence
      compression across multiple postings. Today we use the
      length-compression trick for the payloads: if the previous payload had
      the same length as the current one, we don't store the length
      explicitly again, but only set a bit in the shifted position VInt. Since
      often all payloads of one posting list have the same length, this
      results in effective compression.
      Now an advanced user might want to implement a similar encoding, where
      it's not enough to just control serialization of a single value, but
      where e.g. the previous position can be taken into account to decide
      how to encode a value.
      I'm not sure yet how this extension point should look like. Maybe the
      flex APIs are actually already sufficient.

      One major goal of this feature is performance: It ought to be more
      efficient to e.g. define an attribute that writes and reads a single
      VInt than storing that VInt as a payload. The payload has the overhead
      of converting the data into a byte array first. An attribute on the other
      hand should be able to call 'int value = dataInput.readVInt();' directly
      without the byte[] indirection.

      After this part is done I'd like to use a very similar approach for
      column-stride fields.




            michaelbusch Michael Busch
            michaelbusch Michael Busch
            0 Vote for this issue
            2 Start watching this issue