Lucene - Core
  1. Lucene - Core
  2. LUCENE-4226

Efficient compression of small to medium stored fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been doing some experiments with stored fields lately. It is very common for an index with stored fields enabled to have most of its space used by the .fdt index file. To prevent this .fdt file from growing too much, one option is to compress stored fields. Although compression works rather well for large fields, this is not the case for small fields and the compression ratio can be very close to 100%, even with efficient compression algorithms.

      In order to improve the compression ratio for small fields, I've written a StoredFieldsFormat that compresses several documents in a single chunk of data. To see how it behaves in terms of document deserialization speed and compression ratio, I've run several tests with different index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text were indexed and stored):

      • no compression,
      • docs compressed with deflate (compression level = 1),
      • docs compressed with deflate (compression level = 9),
      • docs compressed with Snappy,
      • using the compressing StoredFieldsFormat with deflate (level = 1) and chunks of 6 docs,
      • using the compressing StoredFieldsFormat with deflate (level = 9) and chunks of 6 docs,
      • using the compressing StoredFieldsFormat with Snappy and chunks of 6 docs.

      For those who don't know Snappy, it is compression algorithm from Google which has very high compression ratios, but compresses and decompresses data very quickly.

      Format           Compression ratio     IndexReader.document time
      ————————————————————————————————————————————————————————————————
      uncompressed     100%                  100%
      doc/deflate 1     59%                  616%
      doc/deflate 9     58%                  595%
      doc/snappy        80%                  129%
      index/deflate 1   49%                  966%
      index/deflate 9   46%                  938%
      index/snappy      65%                  264%
      

      (doc = doc-level compression, index = index-level compression)

      I find it interesting because it allows to trade speed for space (with deflate, the .fdt file shrinks by a factor of 2, much better than with doc-level compression). One other interesting thing is that index/snappy is almost as compact as doc/deflate while it is more than 2x faster at retrieving documents from disk.

      These tests have been done on a hot OS cache, which is the worst case for compressed fields (one can expect better results for formats that have a high compression ratio since they probably require fewer read/write operations from disk).

      1. CompressionBenchmark.java
        8 kB
        Adrien Grand
      2. CompressionBenchmark.java
        11 kB
        Adrien Grand
      3. LUCENE-4226.patch
        111 kB
        Adrien Grand
      4. LUCENE-4226.patch
        110 kB
        Adrien Grand
      5. LUCENE-4226.patch
        114 kB
        Adrien Grand
      6. LUCENE-4226.patch
        114 kB
        Adrien Grand
      7. LUCENE-4226.patch
        109 kB
        Adrien Grand
      8. LUCENE-4226.patch
        83 kB
        Adrien Grand
      9. LUCENE-4226.patch
        85 kB
        Adrien Grand
      10. LUCENE-4226.patch
        55 kB
        Adrien Grand
      11. SnappyCompressionAlgorithm.java
        4 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment - - edited

          Patch (applies against trunk and does not include the snappy codec).

          See org.apache.lucene.codecs.compressing.CompressedStoredFieldsFormat javadocs for the format description.

          CompressionBenchmark.java and SnappyCompressionAlgorithm.java are the source files I used to compute differences in compression ratio and speed. To run it, you will need snappy-java from http://code.google.com/p/snappy-java/.

          The patch is currently only for testing purposes: it hasn't been tested well and duplicates code from Lucene40's StoredFieldFormat.

          Show
          Adrien Grand added a comment - - edited Patch (applies against trunk and does not include the snappy codec). See org.apache.lucene.codecs.compressing.CompressedStoredFieldsFormat javadocs for the format description. CompressionBenchmark.java and SnappyCompressionAlgorithm.java are the source files I used to compute differences in compression ratio and speed. To run it, you will need snappy-java from http://code.google.com/p/snappy-java/ . The patch is currently only for testing purposes: it hasn't been tested well and duplicates code from Lucene40's StoredFieldFormat .
          Hide
          Adrien Grand added a comment -

          New patch as well as the code I used to benchmark.

          Documents are still compressed into chunks, but I removed the ability to select the compression algorithm on a per-field basis in order to make the patch simpler and to handle cross-field compression.

          I also added an index in front of compressed data using packed ints, so that uncompressors can stop uncompressing when enough data has been uncompressed.

          The JDK only includes a moderately fast compression algorithm (deflate), but for this kind of use-case, we would probably be more interested in fast compression and uncompression algorithms such as LZ4 (http://code.google.com/p/lz4/) or Snappy (http://code.google.com/p/snappy/). Since lucene-core has no dependency, I ported LZ4 to Java (included in the patch, see o.a.l.util.compress).

          LZ4 has a very fast uncompressor and two compression modes :

          • fast scan, which looks for the last offset in the stream that has at least 4 common bytes (using a hash table) and adds a reference to it,
          • high compression, which looks for the last 256 offsets in the stream that have at least 4 common bytes, takes the one that has the longest common sequence, and then performs trade-offs between overlapping matches in order to improve the compression ratio.

          (In case you are curious about LZ4, I did some benchmarking with other compression algorithms in http://blog.jpountz.net/post/28092106032/wow-lz4-is-fast, unfortunately the high-compression Java impl is not included in the benchmark.)

          I ran a similar benchmark as for my first patch, but this time I only compressed and stored the 1kb text field (the title field being too small was unfair for document-level compression with deflate). Here are the results :

          Format           Chunk size  Compression ratio     IndexReader.document time
          ————————————————————————————————————————————————————————————————————————————
          uncompressed                               100%                         100%
          doc/deflate 1                               58%                         579%
          doc/deflate 9                               57%                         577%
          index/deflate 1          4K                 50%                        1057%
          index/deflate 9          4K                 48%                        1037%
          index/lz4 scan           4K                 70%                         329%
          index/lz4 hc             4K                 66%                         321%
          index/deflate 1           1                 60%                         457%
          index/deflate 9           1                 59%                         454%
          index/lz4 scan            1                 81%                         171%
          index/lz4 hc              1                 79%                         176%
          

          NOTE: chunk size = 1 means that there was only one document in the chunk (there is a compress+flush every time the byte size of documents is >= the chunk size).

          NOTE: these number have been computed with the whole index fitting in the I/O cache. The performance should be more in favor of the compressing formats as soon as the index does not fit in the I/O cache anymore.

          There are still a few nocommits in the patch, but it should be easy to get rid of them. I'd be very happy to have some feedback.

          Show
          Adrien Grand added a comment - New patch as well as the code I used to benchmark. Documents are still compressed into chunks, but I removed the ability to select the compression algorithm on a per-field basis in order to make the patch simpler and to handle cross-field compression. I also added an index in front of compressed data using packed ints, so that uncompressors can stop uncompressing when enough data has been uncompressed. The JDK only includes a moderately fast compression algorithm (deflate), but for this kind of use-case, we would probably be more interested in fast compression and uncompression algorithms such as LZ4 ( http://code.google.com/p/lz4/ ) or Snappy ( http://code.google.com/p/snappy/ ). Since lucene-core has no dependency, I ported LZ4 to Java (included in the patch, see o.a.l.util.compress). LZ4 has a very fast uncompressor and two compression modes : fast scan, which looks for the last offset in the stream that has at least 4 common bytes (using a hash table) and adds a reference to it, high compression, which looks for the last 256 offsets in the stream that have at least 4 common bytes, takes the one that has the longest common sequence, and then performs trade-offs between overlapping matches in order to improve the compression ratio. (In case you are curious about LZ4, I did some benchmarking with other compression algorithms in http://blog.jpountz.net/post/28092106032/wow-lz4-is-fast , unfortunately the high-compression Java impl is not included in the benchmark.) I ran a similar benchmark as for my first patch, but this time I only compressed and stored the 1kb text field (the title field being too small was unfair for document-level compression with deflate). Here are the results : Format Chunk size Compression ratio IndexReader.document time ———————————————————————————————————————————————————————————————————————————— uncompressed 100% 100% doc/deflate 1 58% 579% doc/deflate 9 57% 577% index/deflate 1 4K 50% 1057% index/deflate 9 4K 48% 1037% index/lz4 scan 4K 70% 329% index/lz4 hc 4K 66% 321% index/deflate 1 1 60% 457% index/deflate 9 1 59% 454% index/lz4 scan 1 81% 171% index/lz4 hc 1 79% 176% NOTE: chunk size = 1 means that there was only one document in the chunk (there is a compress+flush every time the byte size of documents is >= the chunk size). NOTE: these number have been computed with the whole index fitting in the I/O cache. The performance should be more in favor of the compressing formats as soon as the index does not fit in the I/O cache anymore. There are still a few nocommits in the patch, but it should be easy to get rid of them. I'd be very happy to have some feedback.
          Hide
          Dawid Weiss added a comment -

          Very cool. I skimmed through the patch, didn't look too carefully. This caught my attention:

          +  /**
          +   * Skip over the next <code>n</code> bytes.
          +   */
          +  public void skipBytes(long n) throws IOException {
          +    for (long i = 0; i < n; ++i) {
          +      readByte();
          +    }
          +  }
          

          you may want to use an array-based read here if there are a lot of skips; allocate a static, write-only buffer of 4 or 8kb once and just reuse it. A loop over readByte() is nearly always a performance killer, I've been hit by this too many times to count.

          Also, lucene/core/src/java/org/apache/lucene/codecs/compressing/ByteArrayDataOutput.java – there seems to be a class for this in
          org.apache.lucene.store.ByteArrayDataOutput?

          Show
          Dawid Weiss added a comment - Very cool. I skimmed through the patch, didn't look too carefully. This caught my attention: + /** + * Skip over the next <code>n</code> bytes. + */ + public void skipBytes( long n) throws IOException { + for ( long i = 0; i < n; ++i) { + readByte(); + } + } you may want to use an array-based read here if there are a lot of skips; allocate a static, write-only buffer of 4 or 8kb once and just reuse it. A loop over readByte() is nearly always a performance killer, I've been hit by this too many times to count. Also, lucene/core/src/java/org/apache/lucene/codecs/compressing/ByteArrayDataOutput.java – there seems to be a class for this in org.apache.lucene.store.ByteArrayDataOutput?
          Hide
          Eks Dev added a comment -

          but I removed the ability to select the compression algorithm on a per-field basis in order to make the patch simpler and to handle cross-field compression.

          Maybe it is worth to keep it there for really short fields. Those general compression algorithms are great for bigger amounts of data, but for really short fields there is nothing like per field compression.
          Thinking about database usage, e.g. fields with low cardinality, or fields with restricted symbol set (only digits in long UID field for example). Say zip code, product color... is perfectly compressed using something with static dictionary approach (static huffman coder with escape symbol-s, at bit level, or plain vanilla dictionary lookup), and both of them are insanely fast and compress heavily.

          Even trivial utility for users is easily doable, index data without compression, get the frequencies from the term dictionary-> estimate e.g. static Huffman code table and reindex with this dictionary.

          Show
          Eks Dev added a comment - but I removed the ability to select the compression algorithm on a per-field basis in order to make the patch simpler and to handle cross-field compression. Maybe it is worth to keep it there for really short fields. Those general compression algorithms are great for bigger amounts of data, but for really short fields there is nothing like per field compression. Thinking about database usage, e.g. fields with low cardinality, or fields with restricted symbol set (only digits in long UID field for example). Say zip code, product color... is perfectly compressed using something with static dictionary approach (static huffman coder with escape symbol-s, at bit level, or plain vanilla dictionary lookup), and both of them are insanely fast and compress heavily. Even trivial utility for users is easily doable, index data without compression, get the frequencies from the term dictionary-> estimate e.g. static Huffman code table and reindex with this dictionary.
          Hide
          Adrien Grand added a comment -

          Thanks Dawid and Eks for your feedback!

          allocate a static, write-only buffer of 4 or 8kb once and just reuse it

          Right, sounds like a better default impl!

          ByteArrayDataOutput.java – there seems to be a class for this in org.apache.lucene.store.ByteArrayDataOutput?

          I wanted to reuse this class, but I needed something that would grow when necessary (oal.store.BADO just throws an exception when you try to write past the end of the buffer). I could manage growth externally based on checks on the buffer length and calls to ArrayUtil.grow and BADO.reset but it was just as simple to rewrite a ByteArrayDataOutput that would manage it internally...

          Maybe it is worth to keep it there for really short fields. Those general compression algorithms are great for bigger amounts of data, but for really short fields there is nothing like per field compression. Thinking about database usage, e.g. fields with low cardinality, or fields with restricted symbol set (only digits in long UID field for example). Say zip code, product color... is perfectly compressed using something with static dictionary approach (static huffman coder with escape symbol-s, at bit level, or plain vanilla dictionary lookup), and both of them are insanely fast and compress heavily.

          Right, this is exactly why I implemented per-field compression first. Both per-field and cross-field compression have pros and cons. Cross-field compression allows less fine-grained tuning but on the other hand it would probably be a better default since the compression ratio would be better out of the box. Maybe we should implement both?

          I was also thinking that some codecs such as this kind of per-field compression, but maybe even the bloom, memory, direct and pulsing postings formats might deserve a separate "codecs" module where we could put these non-default "expert" codecs.

          Show
          Adrien Grand added a comment - Thanks Dawid and Eks for your feedback! allocate a static, write-only buffer of 4 or 8kb once and just reuse it Right, sounds like a better default impl! ByteArrayDataOutput.java – there seems to be a class for this in org.apache.lucene.store.ByteArrayDataOutput? I wanted to reuse this class, but I needed something that would grow when necessary (oal.store.BADO just throws an exception when you try to write past the end of the buffer). I could manage growth externally based on checks on the buffer length and calls to ArrayUtil.grow and BADO.reset but it was just as simple to rewrite a ByteArrayDataOutput that would manage it internally... Maybe it is worth to keep it there for really short fields. Those general compression algorithms are great for bigger amounts of data, but for really short fields there is nothing like per field compression. Thinking about database usage, e.g. fields with low cardinality, or fields with restricted symbol set (only digits in long UID field for example). Say zip code, product color... is perfectly compressed using something with static dictionary approach (static huffman coder with escape symbol-s, at bit level, or plain vanilla dictionary lookup), and both of them are insanely fast and compress heavily. Right, this is exactly why I implemented per-field compression first. Both per-field and cross-field compression have pros and cons. Cross-field compression allows less fine-grained tuning but on the other hand it would probably be a better default since the compression ratio would be better out of the box. Maybe we should implement both? I was also thinking that some codecs such as this kind of per-field compression, but maybe even the bloom, memory, direct and pulsing postings formats might deserve a separate "codecs" module where we could put these non-default "expert" codecs.
          Hide
          Robert Muir added a comment -

          I was also thinking that some codecs such as this kind of per-field compression, but maybe even the bloom, memory, direct and pulsing postings formats might deserve a separate "codecs" module where we could put these non-default "expert" codecs.

          We have to do something about this soon!

          Do you want to open a separate issue for that (it need not block this issue)?

          I think we would try to get everything concrete we can out of core immediately
          (maybe saving only the default codec for that release), but use the other
          ones for testing. Still we should think about it.

          Show
          Robert Muir added a comment - I was also thinking that some codecs such as this kind of per-field compression, but maybe even the bloom, memory, direct and pulsing postings formats might deserve a separate "codecs" module where we could put these non-default "expert" codecs. We have to do something about this soon! Do you want to open a separate issue for that (it need not block this issue)? I think we would try to get everything concrete we can out of core immediately (maybe saving only the default codec for that release), but use the other ones for testing. Still we should think about it.
          Hide
          Adrien Grand added a comment -

          Do you want to open a separate issue for that (it need not block this issue)?

          I created LUCENE-4340.

          Show
          Adrien Grand added a comment - Do you want to open a separate issue for that (it need not block this issue)? I created LUCENE-4340 .
          Hide
          David Smiley added a comment -

          I just have a word of encouragement – this is awesome! Keep up the good work Adrien.

          Show
          David Smiley added a comment - I just have a word of encouragement – this is awesome! Keep up the good work Adrien.
          Hide
          Adrien Grand added a comment -

          Thanks for your kind words, David!

          Here is a new version of the patch. I've though a lot about whether or not to let users configure per-field compression, but I think we should just try to provide something simple that improves the compression ratio by allowing cross-field and cross-document compression ; People who have very specific needs can still implement their own StoredFieldsFormat.

          Moreover I've had a discussion with Robert who argued that we should limit the number of classes that are exposed as a SPI because they add complexity (for example Solr needs to reload SPI registers every time it adds a core lib directory to the classpath). So I tried to make it simpler: there is no more CompressionCodec and people can choose between 3 different compression modes:

          • FAST, that uses LZ4's fast compressors and uncompressors (for indices that have a high update rate),
          • HIGH_COMPRESSION, that uses deflate (for people who want low compression ratios, no matter what the performance penalty is),
          • FAST_UNCOMPRESSION that spends more time compressing using LZ4's compress_HC method but still has very fast uncompression (for indices that have a reasonnable update rate and need good read performance).

          I also added a test case and applied Dawid's advice to replace the default skipBytes implementation with a bulk-write into a write-only buffer.

          Here is a new benchmark that shows how this new codec can help compress stored fields. This time, I indexed some access.log files generated by Apache HTTP server. A document consists of a line from the log file and is typically between 100 and 300 bytes. Because every line contains the date of the request, its path and the user-agent of the client, there is a lot of redundancy across documents.

          Format            Chunk size  Compression ratio     IndexReader.document time
          —————————————————————————————————————————————————————————————————————————————
          uncompressed                               100%                         100%
          doc/deflate 1                               90%                        1557%
          doc/deflate 9                               90%                        1539%
          index/FAST               512                50%                         197%
          index/HIGH_COMPRESSION   512                44%                        1545%
          index/FAST_UNCOMPRESSION 512                50%                         198%
          

          Because documents are very small, document-level compression doesn't work well and only makes the .fdt file 10% smaller while loading documents from disk is more than 15 times slower on a hot OS cache.

          However, with this kind of highly redundant input, CompressionMode.FAST looks very interesting as it divides the size of the .fdt file by 2 and only makes IndexReader.document twice slower.

          Show
          Adrien Grand added a comment - Thanks for your kind words, David! Here is a new version of the patch. I've though a lot about whether or not to let users configure per-field compression, but I think we should just try to provide something simple that improves the compression ratio by allowing cross-field and cross-document compression ; People who have very specific needs can still implement their own StoredFieldsFormat . Moreover I've had a discussion with Robert who argued that we should limit the number of classes that are exposed as a SPI because they add complexity (for example Solr needs to reload SPI registers every time it adds a core lib directory to the classpath). So I tried to make it simpler: there is no more CompressionCodec and people can choose between 3 different compression modes: FAST, that uses LZ4's fast compressors and uncompressors (for indices that have a high update rate), HIGH_COMPRESSION, that uses deflate (for people who want low compression ratios, no matter what the performance penalty is), FAST_UNCOMPRESSION that spends more time compressing using LZ4's compress_HC method but still has very fast uncompression (for indices that have a reasonnable update rate and need good read performance). I also added a test case and applied Dawid's advice to replace the default skipBytes implementation with a bulk-write into a write-only buffer. Here is a new benchmark that shows how this new codec can help compress stored fields. This time, I indexed some access.log files generated by Apache HTTP server. A document consists of a line from the log file and is typically between 100 and 300 bytes. Because every line contains the date of the request, its path and the user-agent of the client, there is a lot of redundancy across documents. Format Chunk size Compression ratio IndexReader.document time ————————————————————————————————————————————————————————————————————————————— uncompressed 100% 100% doc/deflate 1 90% 1557% doc/deflate 9 90% 1539% index/FAST 512 50% 197% index/HIGH_COMPRESSION 512 44% 1545% index/FAST_UNCOMPRESSION 512 50% 198% Because documents are very small, document-level compression doesn't work well and only makes the .fdt file 10% smaller while loading documents from disk is more than 15 times slower on a hot OS cache. However, with this kind of highly redundant input, CompressionMode.FAST looks very interesting as it divides the size of the .fdt file by 2 and only makes IndexReader.document twice slower.
          Hide
          Adrien Grand added a comment -

          Otis shared a link to this issue on Twitter https://twitter.com/otisg/status/244996292743405571 and some people seem to wonder how it compares to ElasticSearch's block compression.

          ElasticSearch's block compression uses a similar idea: data is compressed into blocks (with fixed sizes that are independent from document sizes). It is based on a CompressedIndexInput/CompressedIndexOutput: Upon closing, CompressedIndexOutput writes a metadata table at the end of the wrapped output that contains the start offset of every compressed block. Upon creation, a CompressedIndexInput first loads this metadata table into memory and can then use it whenever it needs to seek. This is probably the best way to compress small docs with Lucene 3.x.

          With this patch, the size of blocks is not completely independent from document sizes: I make sure that documents don't spread across compressed blocks so that reading a document never requires more than one block to be uncompressed. Moreover, the LZ4 uncompressor (used by FAST and FAST_UNCOMPRESSION) can stop uncompressing whenever it has uncompressed enough data. So unless you need the last document of a compressed block, it is very likely that the uncompressor won't uncompress the whole block before returning.

          Therefore I expect this StoredFieldsFormat to have a similar compression ratio to ElasticSearch's block compression (provided that similar compression algorithms are used) but to be a little faster at loading documents from disk.

          Show
          Adrien Grand added a comment - Otis shared a link to this issue on Twitter https://twitter.com/otisg/status/244996292743405571 and some people seem to wonder how it compares to ElasticSearch's block compression. ElasticSearch's block compression uses a similar idea: data is compressed into blocks (with fixed sizes that are independent from document sizes). It is based on a CompressedIndexInput/CompressedIndexOutput: Upon closing, CompressedIndexOutput writes a metadata table at the end of the wrapped output that contains the start offset of every compressed block. Upon creation, a CompressedIndexInput first loads this metadata table into memory and can then use it whenever it needs to seek. This is probably the best way to compress small docs with Lucene 3.x. With this patch, the size of blocks is not completely independent from document sizes: I make sure that documents don't spread across compressed blocks so that reading a document never requires more than one block to be uncompressed. Moreover, the LZ4 uncompressor (used by FAST and FAST_UNCOMPRESSION) can stop uncompressing whenever it has uncompressed enough data. So unless you need the last document of a compressed block, it is very likely that the uncompressor won't uncompress the whole block before returning. Therefore I expect this StoredFieldsFormat to have a similar compression ratio to ElasticSearch's block compression (provided that similar compression algorithms are used) but to be a little faster at loading documents from disk.
          Hide
          Adrien Grand added a comment -

          New version of the patch. It contains a few enhancements:

          • Merge optimization: whenever possible the StoredFieldsFormat tries to copy compressed data instead of uncompressing it into a buffer before compressing back to an index output,
          • New options for the stored fields index: there are 3 strategies that allow different memory/perf trade-offs:
            • leaving it fully on disk (same as Lucene40, relying on the O/S cache),
            • loading the position of the start of the chunk for every document into memory (requires up to 8 * numDocs bytes, no disk access),
            • loading the position of the start of the chunk and the first doc ID it contains for every chunk (requires up to 12 * numChunks bytes, no disk access, interesting if you have large chunks of compressed data).
          • Improved memory usage and compression ratio (but a little slower) for CompressionMode.FAST (using packed ints).
          • Try to save 1 byte per field by storing the field number and the bits together.
          • More tests.

          So in the end, this StoredFieldsFormat tries to make disk seeks less likely by:

          • giving the ability to load the stored fields index into memory (you never need to seek to find the position of the chunk that contains you document),
          • reducing the size of the fields data file (.fdt) so that the O/S cache can cache more documents.

          Out of curiosity, I tested whether it could be faster for LZ4 to use intermediate buffers for compression and/or uncompression, and it is slower than accessing the index input/output directly (at least with MMapDirectory).

          I hope I'll have something committable soon.

          Show
          Adrien Grand added a comment - New version of the patch. It contains a few enhancements: Merge optimization: whenever possible the StoredFieldsFormat tries to copy compressed data instead of uncompressing it into a buffer before compressing back to an index output, New options for the stored fields index: there are 3 strategies that allow different memory/perf trade-offs: leaving it fully on disk (same as Lucene40, relying on the O/S cache), loading the position of the start of the chunk for every document into memory (requires up to 8 * numDocs bytes, no disk access), loading the position of the start of the chunk and the first doc ID it contains for every chunk (requires up to 12 * numChunks bytes, no disk access, interesting if you have large chunks of compressed data). Improved memory usage and compression ratio (but a little slower) for CompressionMode.FAST (using packed ints). Try to save 1 byte per field by storing the field number and the bits together. More tests. So in the end, this StoredFieldsFormat tries to make disk seeks less likely by: giving the ability to load the stored fields index into memory (you never need to seek to find the position of the chunk that contains you document), reducing the size of the fields data file (.fdt) so that the O/S cache can cache more documents. Out of curiosity, I tested whether it could be faster for LZ4 to use intermediate buffers for compression and/or uncompression, and it is slower than accessing the index input/output directly (at least with MMapDirectory). I hope I'll have something committable soon.
          Hide
          Adrien Grand added a comment -

          New patch:

          • improved documentation,
          • I added CompressingCodec to the list of automatically tested codecs in test-framework,
          • a few bug fixes.

          Please let me know if you would like to review this patch before I commit. Otherwise, I'll commit shortly...

          Show
          Adrien Grand added a comment - New patch: improved documentation, I added CompressingCodec to the list of automatically tested codecs in test-framework, a few bug fixes. Please let me know if you would like to review this patch before I commit. Otherwise, I'll commit shortly...
          Hide
          Robert Muir added a comment -

          im on the phone but i have some questions. give me a few

          Show
          Robert Muir added a comment - im on the phone but i have some questions. give me a few
          Hide
          Adrien Grand added a comment -

          Oh, I didn't mean THAT shortly

          Show
          Adrien Grand added a comment - Oh, I didn't mean THAT shortly
          Hide
          Robert Muir added a comment -

          Shouldnt ByteArrayDataInput override skip to just bump its 'pos'?

          Can we plugin various schemes into MockRandomCodec?

          Show
          Robert Muir added a comment - Shouldnt ByteArrayDataInput override skip to just bump its 'pos'? Can we plugin various schemes into MockRandomCodec?
          Hide
          Robert Muir added a comment -

          OK MockRandom is just postings now... I think we should have a MockRandomCodec too!

          Show
          Robert Muir added a comment - OK MockRandom is just postings now... I think we should have a MockRandomCodec too!
          Hide
          Adrien Grand added a comment -

          Almost the same patch. I removed ByteArrayDataInput.skipBytes(int) and removed "throws IOException" from ByteArrayDataInput.skipBytes(long).

          I think we should have a MockRandomCodec too!

          Maybe we should fix it in a separate issue?

          Show
          Adrien Grand added a comment - Almost the same patch. I removed ByteArrayDataInput.skipBytes(int) and removed "throws IOException" from ByteArrayDataInput.skipBytes(long). I think we should have a MockRandomCodec too! Maybe we should fix it in a separate issue?
          Hide
          Robert Muir added a comment -

          Yeah: i opened another issue to try to straighten this out. We can just bring these frankenstein codecs upto speed there.

          Show
          Robert Muir added a comment - Yeah: i opened another issue to try to straighten this out. We can just bring these frankenstein codecs upto speed there.
          Hide
          Robert Muir added a comment -

          I'm not a fan of the skipBytes on DataInput. Its not actually necessary or used for this patch?

          And today DataInput is always forward-only, i dont like the "may or may not be bidirectional depending if the impl throws UOE".

          I removed it locally and just left it on IndexInput, i think this is cleaner.

          Show
          Robert Muir added a comment - I'm not a fan of the skipBytes on DataInput. Its not actually necessary or used for this patch? And today DataInput is always forward-only, i dont like the "may or may not be bidirectional depending if the impl throws UOE". I removed it locally and just left it on IndexInput, i think this is cleaner.
          Hide
          Adrien Grand added a comment -

          I'm not a fan of the skipBytes on DataInput. Its not actually necessary or used for this patch?

          You're right! I needed it in the first versions of the patch when I reused Lucene40StoredFieldsFormat, but it looks like it's not needed anymore. Let's get rid of it!

          Show
          Adrien Grand added a comment - I'm not a fan of the skipBytes on DataInput. Its not actually necessary or used for this patch? You're right! I needed it in the first versions of the patch when I reused Lucene40StoredFieldsFormat, but it looks like it's not needed anymore. Let's get rid of it!
          Hide
          Adrien Grand added a comment -

          New patch that removes DataInput.skipBytes, this patch does not have any modifications in lucene-core anymore.

          Show
          Adrien Grand added a comment - New patch that removes DataInput.skipBytes , this patch does not have any modifications in lucene-core anymore.
          Hide
          Adrien Grand added a comment -

          Slightly modified patch in order not so seek when writing the stored fields index.

          Show
          Adrien Grand added a comment - Slightly modified patch in order not so seek when writing the stored fields index.
          Hide
          Adrien Grand added a comment -

          I just committed to trunk. I'll wait for a couple of days to make sure Jenkins builds pass before backporting to 4.x. By the way, would it be possible to have one of the Jenkins servers to run lucene-core tests with -Dtests.codec=Compressing for some time?

          Show
          Adrien Grand added a comment - I just committed to trunk. I'll wait for a couple of days to make sure Jenkins builds pass before backporting to 4.x. By the way, would it be possible to have one of the Jenkins servers to run lucene-core tests with -Dtests.codec=Compressing for some time?
          Hide
          Simon Willnauer added a comment -

          By the way, would it be possible to have one of the Jenkins servers to run lucene-core tests with -Dtests.codec=Compressing for some time?

          FYI - http://builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only-compressed/

          Show
          Simon Willnauer added a comment - By the way, would it be possible to have one of the Jenkins servers to run lucene-core tests with -Dtests.codec=Compressing for some time? FYI - http://builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only-compressed/
          Hide
          Adrien Grand added a comment -

          Thanks, Simon!

          Show
          Adrien Grand added a comment - Thanks, Simon!
          Hide
          Adrien Grand added a comment -

          lucene-core tests have passed the whole week-end so I just committed to branch 4.x as well. Thank you again for the Jenkins job, Simon.

          Show
          Adrien Grand added a comment - lucene-core tests have passed the whole week-end so I just committed to branch 4.x as well. Thank you again for the Jenkins job, Simon.
          Hide
          Radim Kolar added a comment -

          is there example config provided?

          Show
          Radim Kolar added a comment - is there example config provided?
          Hide
          Simon Willnauer added a comment -

          @adrien I deleted the jenkins job for this.

          Show
          Simon Willnauer added a comment - @adrien I deleted the jenkins job for this.
          Hide
          Adrien Grand added a comment -

          @radim you can have a look at CompressingCodec in lucene/test-framework
          @Simon ok, thanks!

          Show
          Adrien Grand added a comment - @radim you can have a look at CompressingCodec in lucene/test-framework @Simon ok, thanks!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1395491

          LUCENE-4226: Efficient stored fields compression (merged from r1394578).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1395491 LUCENE-4226 : Efficient stored fields compression (merged from r1394578).
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              3 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development