Lucene - Core
  1. Lucene - Core
  2. LUCENE-4509

Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?

      Stored fields compression has many benefits :

      • it makes the I/O cache work for us,
      • file-based index replication/backup becomes cheaper.

      Things to know:

      • even with incompressible data, there is less than 0.5% overhead with LZ4,
      • LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires ~ 256kB,
      • LZ4 uncompression has almost no memory overhead,
      • on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.

      I think we could use the same default parameters as in CompressingCodec :

      • LZ4 compression,
      • in-memory stored fields index that is very memory-efficient (less than 12 bytes per block of compressed docs) and uses binary search to locate documents in the fields data file,
      • 16 kB blocks (small enough so that there is no major slow down when the whole index would fit into the I/O cache anyway, and large enough to provide interesting compression ratios ; for example Robert got a 0.35 compression ratio with the geonames.org database).

      Any concerns?

      1. LUCENE-4509.patch
        12 kB
        Adrien Grand
      2. LUCENE-4509.patch
        12 kB
        Adrien Grand

        Activity

        Hide
        Robert Muir added a comment -

        I am a strong +1 for this idea.

        I only have one concern, about the defaults. How would this work with laaaarge documents (e.g. those massive Hathitrust book-documents) that might be > 16KB in size?

        Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)?
        If so, lets think of ways to optimize that case.

        Show
        Robert Muir added a comment - I am a strong +1 for this idea. I only have one concern, about the defaults. How would this work with laaaarge documents (e.g. those massive Hathitrust book-documents) that might be > 16KB in size? Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets think of ways to optimize that case.
        Hide
        Yonik Seeley added a comment -

        Nice timing Adrien... I was just going to ask how we could enable this easiest in Solr (or if it should in fact be the default).

        One data point: 100GB of compressed stored fields == 6.25M index entries == 75MB RAM
        That seems decent for a default.

        Show
        Yonik Seeley added a comment - Nice timing Adrien... I was just going to ask how we could enable this easiest in Solr (or if it should in fact be the default). One data point: 100GB of compressed stored fields == 6.25M index entries == 75MB RAM That seems decent for a default.
        Hide
        Robert Muir added a comment -

        I think its ok too. I just didnt know if we could do something trivial like store the offsets-within-the-blocks as packed ints,
        so that it optimizes for this case anyway (offset=0) and only takes a 8bytes+1bit instead of 12 bytes.

        But i don't have a real understanding of what this thing does when docsize > blocksize, i havent dug in that much.

        in any case I think it should be the default: its fast and works also for tiny documents with lots of fields.
        I think people expect the index to be compressed in some way and the stored fields are really wasteful today.

        Show
        Robert Muir added a comment - I think its ok too. I just didnt know if we could do something trivial like store the offsets-within-the-blocks as packed ints, so that it optimizes for this case anyway (offset=0) and only takes a 8bytes+1bit instead of 12 bytes. But i don't have a real understanding of what this thing does when docsize > blocksize, i havent dug in that much. in any case I think it should be the default: its fast and works also for tiny documents with lots of fields. I think people expect the index to be compressed in some way and the stored fields are really wasteful today.
        Hide
        Robert Muir added a comment -

        I'd say to make progress for the default we want to look at:

        • make a concrete impl of CompressingStoredFieldsFormat called Lucene41, hardwired to the defaults and add file format docs?
          This way, we don't have to support all of the Compression options/layouts in the default codec (if someone wants that,
          encourage them to make their own codec with the Compressed settings they like). Back compat is much
          less costly as the parameters are fixed. File format docs are easier
        • should we s/uncompression/decompression/ across the board?
        • tests already look pretty good. I can try to work on some additional ones to try to break it like we did with BlockPF.
        • there is some scary stuff (literal decompressions etc) uncovered by the clover report: https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/org/apache/lucene/codecs/compressing/CompressionMode.html We should make sure any special cases are tested.
        Show
        Robert Muir added a comment - I'd say to make progress for the default we want to look at: make a concrete impl of CompressingStoredFieldsFormat called Lucene41, hardwired to the defaults and add file format docs? This way, we don't have to support all of the Compression options/layouts in the default codec (if someone wants that, encourage them to make their own codec with the Compressed settings they like). Back compat is much less costly as the parameters are fixed. File format docs are easier should we s/uncompression/decompression/ across the board? tests already look pretty good. I can try to work on some additional ones to try to break it like we did with BlockPF. there is some scary stuff (literal decompressions etc) uncovered by the clover report: https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/org/apache/lucene/codecs/compressing/CompressionMode.html We should make sure any special cases are tested.
        Hide
        Adrien Grand added a comment -

        How would this work with laaaarge documents that might be > 16KB in size?

        Actually 16kB is the minimum size of an uncompressed chunk of documents. CompressingStoredFieldsWriter fills a buffer with documents until its size is >= 16kb, compresses it and then flushes to disk. If all documents are greater than 16kB then all chunks will contain exactly one document.

        It also means you could end up having a chunk that is made of 15 documents of 1kb and 1 document of 256kb. (And in this case there is no performance problem for the 15 first documents given that uncompression stops as soon as enough data has been uncompressed.)

        Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets think of ways to optimize that case.

        Probably less than 12. The default CompressingStoredFieldsIndex impl uses two packed ints arrays of size numChunks (the number of chunks, <= numDocs). The first array stores the doc ID of the first document of the chunk while the second array stores the start offset of the chunk of documents in the fields data file.

        So if your fields data file is fdtBytes bytes, the actual memory usage is ~ numChunks * (ceil(log2(numDocs)) + ceil(log2(fdtBytes))) / 8.

        For example, if there are 10M documents of 16kB (fdtBytes ~= 160GB), we'll have numChunks == numDocs and a memory usage per document of (24 + 38) / 8 = 7.75 => ~ 77.5 MB of memory overall.

        100GB of compressed stored fields == 6.25M index entries == 75MB RAM

        Thanks for the figures, Yonik! Did you use RamUsageEstimator to compute the amount of used memory?

        Show
        Adrien Grand added a comment - How would this work with laaaarge documents that might be > 16KB in size? Actually 16kB is the minimum size of an uncompressed chunk of documents. CompressingStoredFieldsWriter fills a buffer with documents until its size is >= 16kb, compresses it and then flushes to disk. If all documents are greater than 16kB then all chunks will contain exactly one document. It also means you could end up having a chunk that is made of 15 documents of 1kb and 1 document of 256kb. (And in this case there is no performance problem for the 15 first documents given that uncompression stops as soon as enough data has been uncompressed.) Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets think of ways to optimize that case. Probably less than 12. The default CompressingStoredFieldsIndex impl uses two packed ints arrays of size numChunks (the number of chunks, <= numDocs). The first array stores the doc ID of the first document of the chunk while the second array stores the start offset of the chunk of documents in the fields data file. So if your fields data file is fdtBytes bytes, the actual memory usage is ~ numChunks * (ceil(log2(numDocs)) + ceil(log2(fdtBytes))) / 8 . For example, if there are 10M documents of 16kB (fdtBytes ~= 160GB), we'll have numChunks == numDocs and a memory usage per document of (24 + 38) / 8 = 7.75 => ~ 77.5 MB of memory overall. 100GB of compressed stored fields == 6.25M index entries == 75MB RAM Thanks for the figures, Yonik! Did you use RamUsageEstimator to compute the amount of used memory?
        Hide
        Adrien Grand added a comment -

        But if we worry about this worst-case (numDocs == numChunks), maybe we should just increase the chunk size (for example, ElasticSearch uses 65 kB by default).

        (Another option would be to change the compress+flush trigger to something like : chunk size >= 16 kB AND number of documents in the chunk >= 4.)

        Show
        Adrien Grand added a comment - But if we worry about this worst-case (numDocs == numChunks), maybe we should just increase the chunk size (for example, ElasticSearch uses 65 kB by default). (Another option would be to change the compress+flush trigger to something like : chunk size >= 16 kB AND number of documents in the chunk >= 4.)
        Hide
        Robert Muir added a comment -

        Well you say you use a separate packed ints structure for the offsets right? so these would all be zero?

        Show
        Robert Muir added a comment - Well you say you use a separate packed ints structure for the offsets right? so these would all be zero?
        Hide
        Adrien Grand added a comment -

        should we s/uncompression/decompression/ across the board?

        If decompression sounds better, let's do this!

        here is some scary stuff (literal decompressions etc) uncovered by the clover report. We should make sure any special cases are tested.

        I can work on it next week.

        Show
        Adrien Grand added a comment - should we s/uncompression/decompression/ across the board? If decompression sounds better, let's do this! here is some scary stuff (literal decompressions etc) uncovered by the clover report. We should make sure any special cases are tested. I can work on it next week.
        Hide
        Adrien Grand added a comment -

        Well you say you use a separate packed ints structure for the offsets right? so these would all be zero?

        These are absolute offsets in the fields data file. For example, when looking up a document, it first performs a binary search in the first array (the one that contains the first document IDs of every chunk). The resulting index is used to find the start offset of the chunk of compressed documents thanks to the second array. When you read data starting at this offset in the fields data file, there is first a packed ints array that stores the uncompressed length of every document in the chunk, and then the compressed data. I'll add file formats docs soon...

        Show
        Adrien Grand added a comment - Well you say you use a separate packed ints structure for the offsets right? so these would all be zero? These are absolute offsets in the fields data file. For example, when looking up a document, it first performs a binary search in the first array (the one that contains the first document IDs of every chunk). The resulting index is used to find the start offset of the chunk of compressed documents thanks to the second array. When you read data starting at this offset in the fields data file, there is first a packed ints array that stores the uncompressed length of every document in the chunk, and then the compressed data. I'll add file formats docs soon...
        Hide
        Robert Muir added a comment -

        No, I'm referring to the second packed ints structure (start offset within a block)

        Show
        Robert Muir added a comment - No, I'm referring to the second packed ints structure (start offset within a block)
        Hide
        Adrien Grand added a comment -

        Committed:

        • trunk r1404215
        • branch 4.x r1404216
        Show
        Adrien Grand added a comment - Committed: trunk r1404215 branch 4.x r1404216
        Hide
        Robert Muir added a comment -

        I think Adrien accidentally resolved the wrong issue

        Show
        Robert Muir added a comment - I think Adrien accidentally resolved the wrong issue
        Hide
        Adrien Grand added a comment -

        Here is a patch that adds a new Lucene41StoredFieldsFormat class with file format docs.

        Show
        Adrien Grand added a comment - Here is a patch that adds a new Lucene41StoredFieldsFormat class with file format docs.
        Hide
        Adrien Grand added a comment -

        I forgot to say: oal.codecs.compressing needs to be moved to lucene-core before applying this patch.

        Show
        Adrien Grand added a comment - I forgot to say: oal.codecs.compressing needs to be moved to lucene-core before applying this patch.
        Hide
        Robert Muir added a comment -

        Do we know whats happening with the recent test fail?

        ant test  -Dtestcase=TestCompressingStoredFieldsFormat -Dtests.method=testBigDocuments -Dtests.seed=37812FE503010D20 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=es_PR -Dtests.timezone=America/Sitka 
        
        Show
        Robert Muir added a comment - Do we know whats happening with the recent test fail? ant test -Dtestcase=TestCompressingStoredFieldsFormat -Dtests.method=testBigDocuments -Dtests.seed=37812FE503010D20 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=es_PR -Dtests.timezone=America/Sitka
        Hide
        Adrien Grand added a comment -

        I think I abuse atLeast to generate documents sizes and because the test ran with tests.multipliers=true and tests.nightly=true, documents got too big, hence the OOME. I'll commit a fix shortly.

        Show
        Adrien Grand added a comment - I think I abuse atLeast to generate documents sizes and because the test ran with tests.multipliers=true and tests.nightly=true, documents got too big, hence the OOME. I'll commit a fix shortly.
        Hide
        Robert Muir added a comment -

        In the fdt we write docBase of the first document in the chunk: Can you explain why this is needed?

        We already redundantly write this in the fdx right? (or in the DISK_DOC case its implicit).

        It seems to me in visitDocument() we should be getting the docBase and startPointer too from the index,
        since it knows both.

        Show
        Robert Muir added a comment - In the fdt we write docBase of the first document in the chunk: Can you explain why this is needed? We already redundantly write this in the fdx right? (or in the DISK_DOC case its implicit). It seems to me in visitDocument() we should be getting the docBase and startPointer too from the index, since it knows both.
        Hide
        Robert Muir added a comment -

        Actually i guess we dont know it for DISK_DOC. But it seems unnecessary for MEMORY_CHUNK ?

        Show
        Robert Muir added a comment - Actually i guess we dont know it for DISK_DOC. But it seems unnecessary for MEMORY_CHUNK ?
        Hide
        Adrien Grand added a comment -

        Right, the docBase could be known from the index with MEMORY_CHUNK, but on the other hand duplicating the information helps validating that we are at the right place in the fields data file (there are corruption tests that use this docBase). Given that the chunk starts with a doc base and the number of docs in the chunk, it gives the range of documents it contains. The overhead should be very small given that this VInt is repeated at most every

        {compressed size of 16KB}

        . But I have no strong feeling about it, if you think we should remove it, then let's do it.

        Show
        Adrien Grand added a comment - Right, the docBase could be known from the index with MEMORY_CHUNK, but on the other hand duplicating the information helps validating that we are at the right place in the fields data file (there are corruption tests that use this docBase). Given that the chunk starts with a doc base and the number of docs in the chunk, it gives the range of documents it contains. The overhead should be very small given that this VInt is repeated at most every {compressed size of 16KB} . But I have no strong feeling about it, if you think we should remove it, then let's do it.
        Hide
        Robert Muir added a comment -

        I don't feel strongly about it either... was just reading the docs and noticed the redudancy.

        But you are right: its just per-chunk anyway. And i like the corruption check...!

        Show
        Robert Muir added a comment - I don't feel strongly about it either... was just reading the docs and noticed the redudancy. But you are right: its just per-chunk anyway. And i like the corruption check...!
        Hide
        Adrien Grand added a comment -

        Updated file format docs, you need to move lucene/codecs/src/java/org/apache/lucene/codecs/compressing to lucene/core/src/java/org/apache/lucene/codecs in addition to applying the patch.

        Show
        Adrien Grand added a comment - Updated file format docs, you need to move lucene/codecs/src/java/org/apache/lucene/codecs/compressing to lucene/core/src/java/org/apache/lucene/codecs in addition to applying the patch.
        Hide
        Robert Muir added a comment -

        Docs look good, +1 to commit.

        A few suggestions:

        • under known limitations maybe replace documents with "individual documents" to make it clear you are talking about 2 gigabyte documents and not files? I think someone was confused on that already a little bit.
        • rather than repeating the formulas for signed vlong (zigzag), we could link to it? https://developers.google.com/protocol-buffers/docs/encoding#types
        • separately if we find ourselves using this more often, maybe we should just add it to DataOutput/Input (the vlong version would be enough). We
          already use this in kuromoji's ConnectionCosts.java too...
        Show
        Robert Muir added a comment - Docs look good, +1 to commit. A few suggestions: under known limitations maybe replace documents with "individual documents" to make it clear you are talking about 2 gigabyte documents and not files? I think someone was confused on that already a little bit. rather than repeating the formulas for signed vlong (zigzag), we could link to it? https://developers.google.com/protocol-buffers/docs/encoding#types separately if we find ourselves using this more often, maybe we should just add it to DataOutput/Input (the vlong version would be enough). We already use this in kuromoji's ConnectionCosts.java too...
        Hide
        Adrien Grand added a comment -

        Thanks Robert for your comments, I replaced "documents" with "individual documents" and added a link to the protobuf docs.

        Committed:

        • trunk r1408762
        • branch 4.x r1408796
        Show
        Adrien Grand added a comment - Thanks Robert for your comments, I replaced "documents" with "individual documents" and added a link to the protobuf docs. Committed: trunk r1408762 branch 4.x r1408796
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1416082

        Move oal.codec.compressing tests from lucene/codecs to lucene/core (should have been done as part of LUCENE-4509 when I moved the src folder).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1416082 Move oal.codec.compressing tests from lucene/codecs to lucene/core (should have been done as part of LUCENE-4509 when I moved the src folder).
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1408796

        LUCENE-4509: Enable stored fields compression in Lucene41Codec (merged from r1408762).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1408796 LUCENE-4509 : Enable stored fields compression in Lucene41Codec (merged from r1408762).
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1404276

        LUCENE-4509: New tests to try to break CompressingStoredFieldsFormat... (merged from r1404275)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1404276 LUCENE-4509 : New tests to try to break CompressingStoredFieldsFormat... (merged from r1404275)
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1403032

        LUCENE-4509: improve test coverage of CompressingStoredFieldsFormat (merged from r1403027).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1403032 LUCENE-4509 : improve test coverage of CompressingStoredFieldsFormat (merged from r1403027).

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development