[LUCENE-2125] Ability to store and retrieve attributes in the inverted index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 4.0-ALPHA
Fix Version/s: 4.9, 6.0
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

Now that we have the cool attribute-based TokenStream API and also the
great new flexible indexing features, the next logical step is to
allow storing the attributes inline in the posting lists. Currently
this is only supported for the PayloadAttribute.

The flex search APIs already provide an AttributeSource, so there will
be a very clean and performant symmetry. It should be seamlessly
possible for the user to define a new attribute, add it to the
TokenStream, and then retrieve it from the flex search APIs.

What I'm planning to do is to add additional methods to the token
attributes (e.g. by adding a new class TokenAttributeImpl, which
extends AttributeImpl and is the super class of all impls in
o.a.l.a.tokenattributes):

void serialize(DataOutput)
void deserialize(DataInput)
boolean storeInIndex()

The indexer will only call the serialize method of an
TokenAttributeImpl in case its storeInIndex() returns true.

The big advantage here is the ease-of-use: A user can implement in one
place everything necessary to add the attribute to the index.

Btw: I'd like to introduce DataOutput and DataInput as super classes
of IndexOutput and IndexInput. They will contain methods like
readByte(), readVInt(), etc., but methods such as close(),
getFilePointer() etc. will stay in the super classes.

Currently the payload concept is hardcoded in
TermsHashPerField and FreqProxTermsWriterPerField. These classes take
care of copying the contents of the PayloadAttribute over into the
intermediate in-memory postinglist representation and reading it
again. Ideally these classes should not know about specific
attributes, but only call serialze() on those attributes that shall
be stored in the posting list.

We also need to change the PositionsEnum and PositionsConsumer APIs to
deal with attributes instead of payloads.

I think the new codecs should all support storing attributes. Only the
preflex one should be hardcoded to only take the PayloadAttribute into
account.

We'll possibly need another extension point that allows us to influence
compression across multiple postings. Today we use the
length-compression trick for the payloads: if the previous payload had
the same length as the current one, we don't store the length
explicitly again, but only set a bit in the shifted position VInt. Since
often all payloads of one posting list have the same length, this
results in effective compression.
Now an advanced user might want to implement a similar encoding, where
it's not enough to just control serialization of a single value, but
where e.g. the previous position can be taken into account to decide
how to encode a value.
I'm not sure yet how this extension point should look like. Maybe the
flex APIs are actually already sufficient.

One major goal of this feature is performance: It ought to be more
efficient to e.g. define an attribute that writes and reads a single
VInt than storing that VInt as a payload. The payload has the overhead
of converting the data into a byte array first. An attribute on the other
hand should be able to call 'int value = dataInput.readVInt();' directly
without the byte[] indirection.

After this part is done I'd like to use a very similar approach for
column-stride fields.

Attachments

Activity

People

Assignee:: Michael Busch

Reporter:: Michael Busch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Dec/09 08:47

Updated:: 28/Aug/22 12:16