Lucene - Core
  1. Lucene - Core
  2. LUCENE-4512

Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Robert had a great idea to save memory with CompressingStoredFieldsIndex.MEMORY_CHUNK: instead of storing the absolute start pointers we could compute the mean number of bytes per chunk of documents and only store the delta between the actual value and the expected value (avgChunkBytes * chunkNumber).

      By applying this idea to every n(=1024?) chunks, we would even:

      • make sure to never hit the worst case (delta ~= maxStartPointer)
      • reduce memory usage at indexing time.
      1. LUCENE-4512.patch
        20 kB
        Adrien Grand
      2. LUCENE-4512.patch
        22 kB
        Adrien Grand

        Activity

        Hide
        Robert Muir added a comment -

        I do think we should use n=(some power of 2 or whatever) chunks, because e.g. just testing with that geonames dataset i saw the
        deltas grow quite large at points... this caused it to use 24 bits per value (still better than 29), but with a tiny bit of
        effort I think it could be significantly less.

        Show
        Robert Muir added a comment - I do think we should use n=(some power of 2 or whatever) chunks, because e.g. just testing with that geonames dataset i saw the deltas grow quite large at points... this caused it to use 24 bits per value (still better than 29), but with a tiny bit of effort I think it could be significantly less.
        Hide
        Adrien Grand added a comment -

        Patch. I removed MEMORY_DOC and modified MEMORY_CHUNK to apply the proposed changes:

        • it works with blocks of n=1024 (hard-wired) chunks,
        • for every block, doc bases and start pointers are compressed by only storing the delta from the average.
        Show
        Adrien Grand added a comment - Patch. I removed MEMORY_DOC and modified MEMORY_CHUNK to apply the proposed changes: it works with blocks of n=1024 (hard-wired) chunks, for every block, doc bases and start pointers are compressed by only storing the delta from the average.
        Hide
        Robert Muir added a comment -

        I tested this really fast on that geonames data again: 72 chunks with bpvs of 16-20 (avg 18 i think).
        So this is quite a bit more savings than 29bpv with the trunk code.

        I didnt look at the code too much, but since we are computing the average at index-time (i think?),
        do you think it still makes sense to encode the deltas from the previous value, or should we just
        up-front encode them at index-time as deltas from the average (if it makes things simpler?)

        Show
        Robert Muir added a comment - I tested this really fast on that geonames data again: 72 chunks with bpvs of 16-20 (avg 18 i think). So this is quite a bit more savings than 29bpv with the trunk code. I didnt look at the code too much, but since we are computing the average at index-time (i think?), do you think it still makes sense to encode the deltas from the previous value, or should we just up-front encode them at index-time as deltas from the average (if it makes things simpler?)
        Hide
        Adrien Grand added a comment -

        72 chunks with bpvs of 16-20 (avg 18 i think).

        That is good but I was expecting the distance from average (128kb here) to be less than the chunk size (16kb), which is clearly not the case. Is there anything in the dataset that could explain why chunk sizes vary so much? Or maybe we should just decrease the block size or the average is wrongly computed...

        do you think it still makes sense to encode the deltas from the previous value

        Good question. Encoding deltas currently requires 14 or 15 bits per values (because it can grow a little larger than the chunk size which is 2^14) so it is still a little more compact, and it is less prone to worst cases I think? There is some overhead at read time to build the packed ints array instead of just deserializing it but I think this is negligible. If we manage to make bpvs smaller than 14 on "standard" datasets then I think it makes sense.

        Show
        Adrien Grand added a comment - 72 chunks with bpvs of 16-20 (avg 18 i think). That is good but I was expecting the distance from average (128kb here) to be less than the chunk size (16kb), which is clearly not the case. Is there anything in the dataset that could explain why chunk sizes vary so much? Or maybe we should just decrease the block size or the average is wrongly computed... do you think it still makes sense to encode the deltas from the previous value Good question. Encoding deltas currently requires 14 or 15 bits per values (because it can grow a little larger than the chunk size which is 2^14) so it is still a little more compact, and it is less prone to worst cases I think? There is some overhead at read time to build the packed ints array instead of just deserializing it but I think this is negligible. If we manage to make bpvs smaller than 14 on "standard" datasets then I think it makes sense.
        Hide
        Robert Muir added a comment -

        That is good but I was expecting the distance from average (128kb here) to be less than the chunk size (16kb), which is clearly not the case. Is there anything in the dataset that could explain why chunk sizes vary so much? Or maybe we should just decrease the block size or the average is wrongly computed...

        Probably, i bet rows from the same country and even provinces within a country are typically grouped together?
        Though before this jira issue, i did experiments randomizing the dataset with sort -r and it didnt make much difference...

        In all cases you can get it from http://download.geonames.org/export/dump/allCountries.zip
        Its UTF-8 and you can parse with split("\t")

        Good question. Encoding deltas currently requires 14 or 15 bits per values (because it can grow a little larger than the chunk size which is 2^14) so it is still a little more compact, and it is less prone to worst cases I think? There is some overhead at read time to build the packed ints array instead of just deserializing it but I think this is negligible. If we manage to make bpvs smaller than 14 on "standard" datasets then I think it makes sense.

        Well i wasnt really thinking about a few smaller bits on disk... if we want that, LZ4 this "metadata stuff" too (just kidding!).

        I was just thinking simpler code in the reader.

        Show
        Robert Muir added a comment - That is good but I was expecting the distance from average (128kb here) to be less than the chunk size (16kb), which is clearly not the case. Is there anything in the dataset that could explain why chunk sizes vary so much? Or maybe we should just decrease the block size or the average is wrongly computed... Probably, i bet rows from the same country and even provinces within a country are typically grouped together? Though before this jira issue, i did experiments randomizing the dataset with sort -r and it didnt make much difference... In all cases you can get it from http://download.geonames.org/export/dump/allCountries.zip Its UTF-8 and you can parse with split("\t") Good question. Encoding deltas currently requires 14 or 15 bits per values (because it can grow a little larger than the chunk size which is 2^14) so it is still a little more compact, and it is less prone to worst cases I think? There is some overhead at read time to build the packed ints array instead of just deserializing it but I think this is negligible. If we manage to make bpvs smaller than 14 on "standard" datasets then I think it makes sense. Well i wasnt really thinking about a few smaller bits on disk... if we want that, LZ4 this "metadata stuff" too (just kidding!). I was just thinking simpler code in the reader.
        Hide
        Robert Muir added a comment -

        And once you get all this baked in aren't you itching to do the vectors files too?

        Show
        Robert Muir added a comment - And once you get all this baked in aren't you itching to do the vectors files too?
        Hide
        Adrien Grand added a comment -

        I did some tests with the 1K docs from the wikipedia dump:

        • always 16 or 17 bpvs for start pointers, (my intuition was wrong! )
        • the CompressingStoredFieldsIndex instance is 185.3KB (measured with RamusageEstimator) for 1M docs (0.19 bytes per doc, 3.24 bytes per chunk).

        I tried some other block sizes:

        • 256 : 189.2KB
        • 4096 : 204.8 KB

        1024 looks like a good setting.

        I was just thinking simpler code in the reader.

        Hmm good point. It is true that it is already complex enough... Here is a new patch.

        And once you get all this baked in aren't you itching to do the vectors files too?

        I started thinking to it but I'm not very familiar with the terms vectors file formats yet. There are probably other places that might benefit from compression (terms dictionary?).

        Show
        Adrien Grand added a comment - I did some tests with the 1K docs from the wikipedia dump: always 16 or 17 bpvs for start pointers, (my intuition was wrong! ) the CompressingStoredFieldsIndex instance is 185.3KB (measured with RamusageEstimator) for 1M docs (0.19 bytes per doc, 3.24 bytes per chunk). I tried some other block sizes: 256 : 189.2KB 4096 : 204.8 KB 1024 looks like a good setting. I was just thinking simpler code in the reader. Hmm good point. It is true that it is already complex enough... Here is a new patch. And once you get all this baked in aren't you itching to do the vectors files too? I started thinking to it but I'm not very familiar with the terms vectors file formats yet. There are probably other places that might benefit from compression (terms dictionary?).
        Hide
        Robert Muir added a comment -

        This is the one you committed! Marking resolved.

        Show
        Robert Muir added a comment - This is the one you committed! Marking resolved.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1404216

        LUCENE-4512: Additional memory savings for CompressingStoredFieldsIndex.MEMORY_CHUNK (merged from r1404215).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1404216 LUCENE-4512 : Additional memory savings for CompressingStoredFieldsIndex.MEMORY_CHUNK (merged from r1404215).

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development