Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-738

read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion.

      For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence:

      {open reader; delete 1 doc from that segment; close reader;}

      would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk.

      Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time.

      I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this.

      1. del.dgap.patch.txt
        6 kB
        Doron Cohen
      2. FileFormatDoc.patch.txt
        3 kB
        Doron Cohen

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Sounds like a good idea!

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Sounds like a good idea!
        Hide
        doronc Doron Cohen added a comment -

        I tried two implementations:
        (1) writing d-gaps for ids of deleted docs, and
        (2) writing d-gaps for indexes of non zero bytes in the "byte bits []" array, plus the non-zero byte itself.

        I favor (2) because it needs less CPU and because its code is simpler.

        I ran the following performance test (Win XP, ~2GHz, 2GB RAM):

        • create an index with 2,000,000 short docs.
        • delete 4,000 docs in steps of 8: 0, 8, 16, 24, ...
          ("steps of 8" is worst case for the new code, because this way a single written byte never covers more than one deleted doc.)

        Results for original code:

        Operation......runCnt...recsPerRun........rec/s..elapsedSec
        Deletions_4000......1........12000........232.1.......51.70
        OpenReader.....4000........1......150.0.....26.67
        DeleteDoc........4000............1......1,542.6........2.59
        CloseReader.....4000........1......180.0...-..22.23

        Size of the .del file was 245KB.

        Results for new code:

        Operation......runCnt...recsPerRun........rec/s..elapsedSec
        Deletions_4000......1........12000........263.8.......45.49
        OpenReader.....4000........1......173.3.....23.08
        DeleteDoc........4000............1......1,679.3........2.38
        CloseReader.....4000........1......201.4...-..19.86

        Size of .del file was (at the end) 8KB.

        So the improvement in this somewhat artificial scenario was 13%.
        (I initially thought it would be higher).

        I noticed when running repeatedly that the new code run time is more steady, while the original code run time fluctuates - e.g. 58.8, 51.7, 56.0, 51.8 (reported above are the best times.)

        We should ask: is this improvement sufficient to justify making BitVector code less simple?
        Is a shorter .del file an advantage by itself?
        I tend to "yes" for both questions, but would like to hear more opinions.

        Change is backwards compatible - when opening a .del file (actually a bit-vector file) that was written with older version, it identifies that and reads it correctly. Existing backwards compatibility tests (of lock-less commits) have .del files that are opened correctly with new code.

        I added a test case to BitVectorTest - it tests the switch from writing d-gaps to writing bits as the number of set bits increases or decreases.
        This test also shows for what range of values d-gaps are written:

        • size=100: never
        • size=1,000: count < 6
        • size=10,000: count < 42
        • size=100,000: count < 417
        • size=1,000,000: count < 3125

        All tests pass.

        Show
        doronc Doron Cohen added a comment - I tried two implementations: (1) writing d-gaps for ids of deleted docs, and (2) writing d-gaps for indexes of non zero bytes in the "byte bits []" array, plus the non-zero byte itself. I favor (2) because it needs less CPU and because its code is simpler. I ran the following performance test (Win XP, ~2GHz, 2GB RAM): create an index with 2,000,000 short docs. delete 4,000 docs in steps of 8: 0, 8, 16, 24, ... ("steps of 8" is worst case for the new code, because this way a single written byte never covers more than one deleted doc.) Results for original code: Operation......runCnt...recsPerRun........rec/s..elapsedSec Deletions_4000......1........12000........232.1.......51.70 OpenReader. .. ..4000. .. .. .. .1. .. ...150.0. .. ..26.67 DeleteDoc........4000............1......1,542.6........2.59 CloseReader. ....4000. .. .. .. .1. .. ...180.0. ..-..22.23 Size of the .del file was 245KB. Results for new code: Operation......runCnt...recsPerRun........rec/s..elapsedSec Deletions_4000......1........12000........263.8.......45.49 OpenReader. .. ..4000. .. .. .. .1. .. ...173.3. .. ..23.08 DeleteDoc........4000............1......1,679.3........2.38 CloseReader. ....4000. .. .. .. .1. .. ...201.4. ..-..19.86 Size of .del file was (at the end) 8KB. So the improvement in this somewhat artificial scenario was 13%. (I initially thought it would be higher). I noticed when running repeatedly that the new code run time is more steady, while the original code run time fluctuates - e.g. 58.8, 51.7, 56.0, 51.8 (reported above are the best times.) We should ask: is this improvement sufficient to justify making BitVector code less simple? Is a shorter .del file an advantage by itself? I tend to "yes" for both questions, but would like to hear more opinions. Change is backwards compatible - when opening a .del file (actually a bit-vector file) that was written with older version, it identifies that and reads it correctly. Existing backwards compatibility tests (of lock-less commits) have .del files that are opened correctly with new code. I added a test case to BitVectorTest - it tests the switch from writing d-gaps to writing bits as the number of set bits increases or decreases. This test also shows for what range of values d-gaps are written: size=100: never size=1,000: count < 6 size=10,000: count < 42 size=100,000: count < 417 size=1,000,000: count < 3125 All tests pass.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Doron, I agree with what you have described so far.
        One question... are d-gaps written as a 4 byte integer or as a variable vInt, and what
        drove the decision behind that?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Doron, I agree with what you have described so far. One question... are d-gaps written as a 4 byte integer or as a variable vInt, and what drove the decision behind that?
        Hide
        doronc Doron Cohen added a comment -

        Patch added: "del.dgap.patch.txt" for the above optn "(1) writing d-gaps for ids of deleted docs".

        Patch changes index format, but is backwards compatible.

        I still need to update the FileFormats document - will add that part of the patch later.

        Show
        doronc Doron Cohen added a comment - Patch added: "del.dgap.patch.txt" for the above optn "(1) writing d-gaps for ids of deleted docs". Patch changes index format, but is backwards compatible. I still need to update the FileFormats document - will add that part of the patch later.
        Hide
        doronc Doron Cohen added a comment -

        > are d-gaps written as a 4 byte integer or as a variable vInt,
        > and what drove the decision behind that?

        Vints are used, to save space. Vints (max) size is taken into acount in isSparse().

        Show
        doronc Doron Cohen added a comment - > are d-gaps written as a 4 byte integer or as a variable vInt, > and what drove the decision behind that? Vints are used, to save space. Vints (max) size is taken into acount in isSparse().
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Did a quick code review, everything looks good to me.
        +1

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Did a quick code review, everything looks good to me. +1
        Hide
        doronc Doron Cohen added a comment -

        FileFormat document updated to reflect this format change.

        Show
        doronc Doron Cohen added a comment - FileFormat document updated to reflect this format change.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Committed. Thanks Doron!

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Committed. Thanks Doron!

          People

          • Assignee:
            doronc Doron Cohen
            Reporter:
            doronc Doron Cohen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development