Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Duplicate
-
None
-
None
-
New
Description
This new feature has been proposed and discussed here:
http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
Currently it is possible in Lucene to store data as stored fields or as payloads.
Stored fields provide good performance if you want to load all fields for one
document, because this is an sequential I/O operation.
If you however want to load the data from one field for a large number of
documents, then stored fields perform quite badly, because lot's of I/O seeks
might have to be performed.
A better way to do this is using payloads. By creating a "special" posting list
that has one posting with payload for each document you can "simulate" a column-
stride field. The performance is significantly better compared to stored fields,
however still not optimal. The reason is that for each document the freq value,
which is in this particular case always 1, has to be decoded, also one position
value, which is always 0, has to be loaded.
As a solution we want to add real column-stride fields to Lucene. A possible
format for the new data structure could look like this (CSD stands for column-
stride data, once we decide for a final name for this feature we can change
this):
CSDList --> FixedLengthList | <VariableLengthList, SkipList>
FixedLengthList --> <Payload>^SegSize
VariableLengthList --> <DocDelta, PayloadLength?, Payload>
Payload --> Byte^PayloadLength
PayloadLength --> VInt
SkipList --> see frq.file
We distinguish here between the fixed length and the variable length cases. To
allow flexibility, Lucene could automatically pick the "right" data structure.
This could work like this: When the DocumentsWriter writes a segment it checks
whether all values of a field have the same length. If yes, it stores them as
FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger
merges two or more segments it checks if all segments have a FixedLengthList
with the same length for a column-stride field. If not, it writes a
VariableLengthList to the new segment.
Once this feature is implemented, we should think about making the column-
stride fields updateable, similar to the norms. This will be a very powerful
feature that can for example be used for low-latency tagging of documents.
Other use cases:
- replace norms
- allow to store boost values separately from norms
- as input for the FieldCache, thus providing significantly improved loading
performance (see LUCENE-831)
Things that need to be done here:
- decide for a name for this feature - I think "column-stride fields" was
liked better than "per-document payloads" - Design an API for this feature. We should keep in mind here that these
fields are supposed to be updateable. - Define datastructures.
I would like to get this feature into 2.4. Feedback about the open questions
is very welcome so that we can finalize the design soon and start
implementing.
Attachments
Issue Links
- incorporates
-
LUCENE-2186 First cut at column-stride fields (index values storage)
- Reopened
- is related to
-
LUCENE-831 Complete overhaul of FieldCache API/Implementation
- Open