Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1231

Column-stride fields (aka per-document Payloads)

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • None
    • 4.0-ALPHA
    • core/index
    • None
    • New

    Description

      This new feature has been proposed and discussed here:
      http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results

      Currently it is possible in Lucene to store data as stored fields or as payloads.
      Stored fields provide good performance if you want to load all fields for one
      document, because this is an sequential I/O operation.

      If you however want to load the data from one field for a large number of
      documents, then stored fields perform quite badly, because lot's of I/O seeks
      might have to be performed.

      A better way to do this is using payloads. By creating a "special" posting list
      that has one posting with payload for each document you can "simulate" a column-
      stride field. The performance is significantly better compared to stored fields,
      however still not optimal. The reason is that for each document the freq value,
      which is in this particular case always 1, has to be decoded, also one position
      value, which is always 0, has to be loaded.

      As a solution we want to add real column-stride fields to Lucene. A possible
      format for the new data structure could look like this (CSD stands for column-
      stride data, once we decide for a final name for this feature we can change
      this):

      CSDList --> FixedLengthList | <VariableLengthList, SkipList>
      FixedLengthList --> <Payload>^SegSize
      VariableLengthList --> <DocDelta, PayloadLength?, Payload>
      Payload --> Byte^PayloadLength
      PayloadLength --> VInt
      SkipList --> see frq.file

      We distinguish here between the fixed length and the variable length cases. To
      allow flexibility, Lucene could automatically pick the "right" data structure.
      This could work like this: When the DocumentsWriter writes a segment it checks
      whether all values of a field have the same length. If yes, it stores them as
      FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger
      merges two or more segments it checks if all segments have a FixedLengthList
      with the same length for a column-stride field. If not, it writes a
      VariableLengthList to the new segment.

      Once this feature is implemented, we should think about making the column-
      stride fields updateable, similar to the norms. This will be a very powerful
      feature that can for example be used for low-latency tagging of documents.

      Other use cases:

      • replace norms
      • allow to store boost values separately from norms
      • as input for the FieldCache, thus providing significantly improved loading
        performance (see LUCENE-831)

      Things that need to be done here:

      • decide for a name for this feature - I think "column-stride fields" was
        liked better than "per-document payloads"
      • Design an API for this feature. We should keep in mind here that these
        fields are supposed to be updateable.
      • Define datastructures.

      I would like to get this feature into 2.4. Feedback about the open questions
      is very welcome so that we can finalize the design soon and start
      implementing.

      Attachments

        Issue Links

          Activity

            People

              simonw Simon Willnauer
              michaelbusch Michael Busch
              Votes:
              8 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: