Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9525

Better handle small documents with the new Lucene87StoredFieldsFormat

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 8.7
    • None
    • None
    • New

    Description

      Stored fields configure a maximum number of fields per block, whose goal is to make sure that you don't decompress more than X documents to get access to a single one. However this has interesting effects with the new format.

      For instance we use 4kB of dictionary and blocks of 60kB for at most 512 documents per block. So if your documents are very small, say 10 bytes, the block will be 5120 bytes overall, and we'll first compress 4096 bytes independently, and then 5120-4096=1024 bytes with 4096 bytes of dictionary. In this case training the dictionary takes more time than actually compressing the data, and it's not even sure it's worth it since only 1024 bytes out of the 5120 bytes of the block get compressed with a preset dictionary.

      I'm considering adapting the dictionary size and the block size to the total block size in order to better handle such cases.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpountz Adrien Grand
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m

                  Slack

                    Issue deployment