Lucene - Core
  1. Lucene - Core
  2. LUCENE-2086

When resolving deletes, IW should resolve in term sort order

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.2, 3.0.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      See java-dev thread "IndexWriter.updateDocument performance improvement".

      1. LUCENE-2086.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Attached patch.

        Show
        Michael McCandless added a comment - Attached patch.
        Hide
        Yonik Seeley added a comment -

        for better locality for the disk heads

        It's not just the disk heads, right?
        I assumed the majority of the benefit was due to the cached TermEnum scan (instead of a seek) in TermInfosReader.get()

        Show
        Yonik Seeley added a comment - for better locality for the disk heads It's not just the disk heads, right? I assumed the majority of the benefit was due to the cached TermEnum scan (instead of a seek) in TermInfosReader.get()
        Hide
        Michael McCandless added a comment -

        Ahh, you're right, so long as your deletes are within the same index block (128 terms in length), we avoid the binary search through the terms index and simply scan within the block. Though, you need relatively high density of deletions to see that. Also, no matter what when you cross an indexed term, the binary search will be done. I'll genericize the language in the CHANGES entry.

        And actually this reminds to go make sure the flex branch is doing this optimization too...

        Show
        Michael McCandless added a comment - Ahh, you're right, so long as your deletes are within the same index block (128 terms in length), we avoid the binary search through the terms index and simply scan within the block. Though, you need relatively high density of deletions to see that. Also, no matter what when you cross an indexed term, the binary search will be done. I'll genericize the language in the CHANGES entry. And actually this reminds to go make sure the flex branch is doing this optimization too...
        Hide
        Michael McCandless added a comment -

        OK I changed changes entry to:

        • LUCENE-2086: When resolving deleted terms, do so in term sort order
          for better performance (Bogdan Ghidireac via Mike McCandless)
        Show
        Michael McCandless added a comment - OK I changed changes entry to: LUCENE-2086 : When resolving deleted terms, do so in term sort order for better performance (Bogdan Ghidireac via Mike McCandless)
        Hide
        Yonik Seeley added a comment -

        Though, you need relatively high density of deletions to see that.

        Yep, but re-indexing a whole bunch of documents is a common case - and that will give a very high density (often consecutive terms).
        Also, it may be a complete index build / rebuild, where all the terms will be consecutive. Solr, for instance, can't tell if there are any duplicates when building an index from scratch, so it must use update(). There is a way to tell Solr not to enforce duplicate overwriting (overwrite=false param) but I doubt many people do that, and it risks user error (using it when there are dups in the docs being added, or already in the index).

        Show
        Yonik Seeley added a comment - Though, you need relatively high density of deletions to see that. Yep, but re-indexing a whole bunch of documents is a common case - and that will give a very high density (often consecutive terms). Also, it may be a complete index build / rebuild, where all the terms will be consecutive. Solr, for instance, can't tell if there are any duplicates when building an index from scratch, so it must use update(). There is a way to tell Solr not to enforce duplicate overwriting (overwrite=false param) but I doubt many people do that, and it risks user error (using it when there are dups in the docs being added, or already in the index).
        Hide
        Michael McCandless added a comment -

        Excellent, so this is an important optimization! I'll commit shortly...

        Show
        Michael McCandless added a comment - Excellent, so this is an important optimization! I'll commit shortly...
        Hide
        Tim Smith added a comment -

        any chance this can go into 3.0.0 or a 3.0.1?

        Show
        Tim Smith added a comment - any chance this can go into 3.0.0 or a 3.0.1?
        Hide
        Michael McCandless added a comment -

        any chance this can go into 3.0.0 or a 3.0.1?

        I don't think it should be backported (it's an optimization). Generally we [should] only back port bug fixes.

        Show
        Michael McCandless added a comment - any chance this can go into 3.0.0 or a 3.0.1? I don't think it should be backported (it's an optimization). Generally we [should] only back port bug fixes.
        Hide
        Tim Smith added a comment -

        i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?)

        otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself)

        Show
        Tim Smith added a comment - i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?) otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself)
        Hide
        Jason Rutherglen added a comment -

        I don't think it should be backported (it's an optimization). Generally we [should] only back port bug fixes.

        It doesn't break backwards compatibility and it's well under the hood so it seems like something that go into a sub decimal release?

        Show
        Jason Rutherglen added a comment - I don't think it should be backported (it's an optimization). Generally we [should] only back port bug fixes. It doesn't break backwards compatibility and it's well under the hood so it seems like something that go into a sub decimal release?
        Hide
        Michael McCandless added a comment -

        i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?)

        I'm not sure how much gain you can expect from this patch (there are
        many factors involved) – maybe try it & report back?

        Not sure what the timeframe is for 3.1 at this point...

        otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself)

        I understand... you could run with trunk (and report back!)

        It doesn't break backwards compatibility and it's well under the hood so it seems like something that go into a sub decimal release?

        I know it's tempting to do so, but I think it's important to hold the
        line on only back-porting important bug fixes... as innocent as this
        change looks, it's always possible I screwed something up, and so
        why risk a point release with that? Point releases are supposed to
        be as stable as we can possibly make them.

        This way the change has plenty of time to "bake" on trunk and if
        something is amiss we'll have much more time/attention to catch it.

        I'd rather see us release a 3.1 sooner rather than later, instead.

        Show
        Michael McCandless added a comment - i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?) I'm not sure how much gain you can expect from this patch (there are many factors involved) – maybe try it & report back? Not sure what the timeframe is for 3.1 at this point... otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself) I understand... you could run with trunk (and report back!) It doesn't break backwards compatibility and it's well under the hood so it seems like something that go into a sub decimal release? I know it's tempting to do so, but I think it's important to hold the line on only back-porting important bug fixes... as innocent as this change looks, it's always possible I screwed something up, and so why risk a point release with that? Point releases are supposed to be as stable as we can possibly make them. This way the change has plenty of time to "bake" on trunk and if something is amiss we'll have much more time/attention to catch it. I'd rather see us release a 3.1 sooner rather than later, instead.
        Hide
        Tim Smith added a comment -

        maybe try it & report back?

        i'll see if i can find some cycles to try this against the most painful use case i have

        I'd rather see us release a 3.1 sooner rather than later, instead.

        yes please.
        I would definitely like to see a more accelerated release cycle (even if less functionality gets into each minor release)

        Show
        Tim Smith added a comment - maybe try it & report back? i'll see if i can find some cycles to try this against the most painful use case i have I'd rather see us release a 3.1 sooner rather than later, instead. yes please. I would definitely like to see a more accelerated release cycle (even if less functionality gets into each minor release)
        Hide
        Tim Smith added a comment -

        Got some performance numbers:

        Description of test (NOTE: this is representative of actions that may occur in a running system (not a contrived test)):

        • feed 4 million operations (3/4 are deletes, 1/4 are updates (single field))
        • commit
        • feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly selected))
        • commit

        Numbers:

        Desc Old New
        feed 4 million 56914ms 15698ms
        commit 4 million 9072ms 14291ms
        total (4 million) 65986ms 29989ms
        update 1 million 46096ms 11340ms
        commit 1 million 13501ms 9273ms
        total (1 million) 59597ms 20613ms

        This shows significant improvements with new patched data (1/3 the time for 1 million, about 1/2 the time for initial 4 million feed)

        This means i'm gonna definitely need to incorporate this patch while i'm still on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix)
        Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so i wouldn't have to maintain this patched overlay of code

        Show
        Tim Smith added a comment - Got some performance numbers: Description of test (NOTE: this is representative of actions that may occur in a running system (not a contrived test)): feed 4 million operations (3/4 are deletes, 1/4 are updates (single field)) commit feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly selected)) commit Numbers: Desc Old New feed 4 million 56914ms 15698ms commit 4 million 9072ms 14291ms total (4 million) 65986ms 29989ms update 1 million 46096ms 11340ms commit 1 million 13501ms 9273ms total (1 million) 59597ms 20613ms This shows significant improvements with new patched data (1/3 the time for 1 million, about 1/2 the time for initial 4 million feed) This means i'm gonna definitely need to incorporate this patch while i'm still on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix) Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so i wouldn't have to maintain this patched overlay of code
        Hide
        Michael McCandless added a comment -

        Hmm these really are sizable gains! Thanks for testing.

        I guess since the change is so minor and the gains so sizable we should back port for 3.0.1. If nobody objects I'll do so in a day or two...

        Show
        Michael McCandless added a comment - Hmm these really are sizable gains! Thanks for testing. I guess since the change is so minor and the gains so sizable we should back port for 3.0.1. If nobody objects I'll do so in a day or two...
        Hide
        Uwe Schindler added a comment -

        No problem, you can commit to 3.0 branch as soon as the 3 votes for the release are there and I can release the current artifacts.

        Show
        Uwe Schindler added a comment - No problem, you can commit to 3.0 branch as soon as the 3 votes for the release are there and I can release the current artifacts.
        Hide
        Mark Miller added a comment -

        No objection, but its an awkward precedent - you can see any bleeding edge user always wanting the latest optimization in the next bug fix release (considering how long you'll likely have to wait for x.n).

        But as I'm one that was somewhat pro this for 2.9 (due to some being stuck on 1.4), I won't try and stop it here. I'm a big fan of case by case in general anyway. I agree with your previous comment about trying to keep a bug fix release as stable as possible - and this being a minor change, that would seemt to go along with it - but code is a funny beast, even when dealing with the simple ...

        Show
        Mark Miller added a comment - No objection, but its an awkward precedent - you can see any bleeding edge user always wanting the latest optimization in the next bug fix release (considering how long you'll likely have to wait for x.n). But as I'm one that was somewhat pro this for 2.9 (due to some being stuck on 1.4), I won't try and stop it here. I'm a big fan of case by case in general anyway. I agree with your previous comment about trying to keep a bug fix release as stable as possible - and this being a minor change, that would seemt to go along with it - but code is a funny beast, even when dealing with the simple ...
        Hide
        Michael McCandless added a comment -

        Yeah there's an exception to every rule Fast forward a month or two, if we find out I somehow broke 3.0.1 by back-porting this fix, I'll be real grumpy!

        Show
        Michael McCandless added a comment - Yeah there's an exception to every rule Fast forward a month or two, if we find out I somehow broke 3.0.1 by back-porting this fix, I'll be real grumpy!
        Hide
        Michael McCandless added a comment -

        No problem, you can commit to 3.0 branch as soon as the 3 votes for the release are there and I can release the current artifacts.

        BTW can't/won't you cut the tag back on the right revision, anyway? Ie why freeze the 3.0 branch from any backports?

        Show
        Michael McCandless added a comment - No problem, you can commit to 3.0 branch as soon as the 3 votes for the release are there and I can release the current artifacts. BTW can't/won't you cut the tag back on the right revision, anyway? Ie why freeze the 3.0 branch from any backports?
        Hide
        Uwe Schindler added a comment - - edited

        Just commit it. It is so simple, so if I respin, it can be in 3.0.0.

        I did'nt want to have commits in 3.0, because if I respin a release, I would not be able to only take some of the fixes into 3.0.0. That was the reason.

        Can you put this also in 2.9.2 if you remove the generics?

        Show
        Uwe Schindler added a comment - - edited Just commit it. It is so simple, so if I respin, it can be in 3.0.0. I did'nt want to have commits in 3.0, because if I respin a release, I would not be able to only take some of the fixes into 3.0.0. That was the reason. Can you put this also in 2.9.2 if you remove the generics?
        Hide
        Michael McCandless added a comment -

        I did'nt want to have commits in 3.0, because if I respin a release, I would not be able to only take some of the fixes into 3.0.0. That was the reason.

        Can you put this also in 2.9.2 if you remove the generics?

        OK I'll backport...

        Show
        Michael McCandless added a comment - I did'nt want to have commits in 3.0, because if I respin a release, I would not be able to only take some of the fixes into 3.0.0. That was the reason. Can you put this also in 2.9.2 if you remove the generics? OK I'll backport...
        Hide
        Michael McCandless added a comment -

        Backported to 3.0.x...

        2.9.x next.

        Show
        Michael McCandless added a comment - Backported to 3.0.x... 2.9.x next.
        Hide
        Michael McCandless added a comment -

        OK backported to 2.9.x.

        Show
        Michael McCandless added a comment - OK backported to 2.9.x.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development