Lucene - Core
  1. Lucene - Core
  2. LUCENE-3424

Return sequence ids from IW update/delete/add/commit to allow total ordering outside of IW

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Based on the discussion on the mailing list IW should return sequence ids from update/delete/add and commit to allow ordering of events for consistent transaction logs and recovery.

      1. LUCENE-3424.patch
        59 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a first patch to add sequence ids to the IndexWriter. Add, Update and Delete methods return a long sequence id which is incremented for each operation. For updates and deletes the sequence ids introduce a small overhead in the DeleteQueue since I have to add a long value to each item . However, for addDocument I now have to add an empty Item in the queue to allow increasing seq ids even when you add a document. Since those queue items are very short living I think this is feasible.

        if that is too much of an overhead we can also disable this by default via IWC and make it optional, this is actually very straight forward.

        reviews & comments are very appreciated.

        Show
        Simon Willnauer added a comment - here is a first patch to add sequence ids to the IndexWriter. Add, Update and Delete methods return a long sequence id which is incremented for each operation. For updates and deletes the sequence ids introduce a small overhead in the DeleteQueue since I have to add a long value to each item . However, for addDocument I now have to add an empty Item in the queue to allow increasing seq ids even when you add a document. Since those queue items are very short living I think this is feasible. if that is too much of an overhead we can also disable this by default via IWC and make it optional, this is actually very straight forward. reviews & comments are very appreciated.
        Hide
        Michael McCandless added a comment -

        Patch looks great!

        The basic idea is every IW op (add/update/delete) returns a long
        seqID. This is a "transient" thing (only useful in RAM in your
        current IW session; never stored in the index nor in RAM), and the app
        can use it to know the precise order-of-ops inside IW, to know eg if a
        delete and add happens from two threads at once, which one "took".

        The seqID should never be the same for any 2 ops, even across threads,
        right? Will it ever have "holes" (ie, skip a given value), or must
        all values be accounted for?

        Commit doesn't incr the seqID right? It just returns the max seqID
        in the commit point, right? If you commit having made no "actual"
        changes (eg say you just called optimize), what seqID comes back?

        When an exc occurs is a seqID allocated and then skipped? (Maybe only
        for certain exceptions?).

        If an aborting-exc is hit... will we "lose" a bunch of seqIDs right?
        Like the next op against the IW will assign a previously used seqID?

        seqIDs have nothing to do with flushing? Ie, the app sees no change
        in the returned seqIDs just because a flush occurred under the hood?

        Cool that the new test case is able to use the
        ThreadedIndexingAndSearching base class!

        In general can you give a different name if the seqID was "coded" (<<
        1) vs not? (maybe codedSeqID or something)? Just to reduce chance of
        future errors...

        If the perf hit is negligible I don't think we need to add an IWC
        option?

        Show
        Michael McCandless added a comment - Patch looks great! The basic idea is every IW op (add/update/delete) returns a long seqID. This is a "transient" thing (only useful in RAM in your current IW session; never stored in the index nor in RAM), and the app can use it to know the precise order-of-ops inside IW, to know eg if a delete and add happens from two threads at once, which one "took". The seqID should never be the same for any 2 ops, even across threads, right? Will it ever have "holes" (ie, skip a given value), or must all values be accounted for? Commit doesn't incr the seqID right? It just returns the max seqID in the commit point, right? If you commit having made no "actual" changes (eg say you just called optimize), what seqID comes back? When an exc occurs is a seqID allocated and then skipped? (Maybe only for certain exceptions?). If an aborting-exc is hit... will we "lose" a bunch of seqIDs right? Like the next op against the IW will assign a previously used seqID? seqIDs have nothing to do with flushing? Ie, the app sees no change in the returned seqIDs just because a flush occurred under the hood? Cool that the new test case is able to use the ThreadedIndexingAndSearching base class! In general can you give a different name if the seqID was "coded" (<< 1) vs not? (maybe codedSeqID or something)? Just to reduce chance of future errors... If the perf hit is negligible I don't think we need to add an IWC option?
        Hide
        Simon Willnauer added a comment - - edited

        thanks mike for taking the time, this stuff is hairy.

        The seqID should never be the same for any 2 ops, even across threads,
        right? Will it ever have "holes" (ie, skip a given value), or must
        all values be accounted for?

        one seqID will never be assigned twice. the seq ID is always taken from the current tail of the queue and is final once the tails next pointer is assigned. Yet, in the current patch there is a possibility for holes ie. some seq. ids are not used at all. Currently when I do a full flush (NRT reopen or commit) I need to cut over to the new deletequeue which means that two delete queues are active for a short amount of time. The old queue might be still in use by some DWPT (currently in flight) and the new queue is used for incoming threads. what I do to prevent double assignments is that I use the current old queues max seq id and increment it by the number of active thread states (ie. the max number of possible dwpt in flight). Deletes are no problem at that point since its synced on DW just like flushAllThreads(). I need to think about how we could close those gaps but I think we need to block ie. non-blocking / swap DWPT will not work though.

        Commit doesn't incr the seqID right? It just returns the max seqID
        in the commit point, right? If you commit having made no "actual"
        changes (eg say you just called optimize), what seqID comes back?

        right, it would return the the same seq id or possibly a higher one due to the gaps I explained above.

        When an exc occurs is a seqID allocated and then skipped? (Maybe only
        for certain exceptions?).

        its allocated as basically the last op in DWPT#updateDocument so yes if an exc occurs after that which breaks the DWPT ie. is aborting the ids are skipped. if an exc happens in the same thread ie. during flush it will stay assigned. This could be a problem though but if an exc occurs we are in invalid state anyway, right?

        if an aborting-exc is hit... will we "lose" a bunch of seqIDs right?
        Like the next op against the IW will assign a previously used seqID?

        no previously assigned seqID should not be assigned again. The del queue is global so once you assigned it its gone - once an item is in the queue it should not change

        seqIDs have nothing to do with flushing? Ie, the app sees no change
        in the returned seqIDs just because a flush occurred under the hood?

        right, except of the full flush I mentioned above.

        In general can you give a different name if the seqID was "coded" (<<
        1) vs not? (maybe codedSeqID or something)? Just to reduce chance of
        future errors...

        yeah good point. I tried to not introduce a short living object here so I figured piggy-packing the seq. id is fine but yeah we should name that differently.

        If the perf hit is negligible I don't think we need to add an IWC
        option?

        its just like an update but we save the delete handling - some extra cpu cycles but since the other work is so much heavier I think its ok though.

        Show
        Simon Willnauer added a comment - - edited thanks mike for taking the time, this stuff is hairy. The seqID should never be the same for any 2 ops, even across threads, right? Will it ever have "holes" (ie, skip a given value), or must all values be accounted for? one seqID will never be assigned twice. the seq ID is always taken from the current tail of the queue and is final once the tails next pointer is assigned. Yet, in the current patch there is a possibility for holes ie. some seq. ids are not used at all. Currently when I do a full flush (NRT reopen or commit) I need to cut over to the new deletequeue which means that two delete queues are active for a short amount of time. The old queue might be still in use by some DWPT (currently in flight) and the new queue is used for incoming threads. what I do to prevent double assignments is that I use the current old queues max seq id and increment it by the number of active thread states (ie. the max number of possible dwpt in flight). Deletes are no problem at that point since its synced on DW just like flushAllThreads(). I need to think about how we could close those gaps but I think we need to block ie. non-blocking / swap DWPT will not work though. Commit doesn't incr the seqID right? It just returns the max seqID in the commit point, right? If you commit having made no "actual" changes (eg say you just called optimize), what seqID comes back? right, it would return the the same seq id or possibly a higher one due to the gaps I explained above. When an exc occurs is a seqID allocated and then skipped? (Maybe only for certain exceptions?). its allocated as basically the last op in DWPT#updateDocument so yes if an exc occurs after that which breaks the DWPT ie. is aborting the ids are skipped. if an exc happens in the same thread ie. during flush it will stay assigned. This could be a problem though but if an exc occurs we are in invalid state anyway, right? if an aborting-exc is hit... will we "lose" a bunch of seqIDs right? Like the next op against the IW will assign a previously used seqID? no previously assigned seqID should not be assigned again. The del queue is global so once you assigned it its gone - once an item is in the queue it should not change seqIDs have nothing to do with flushing? Ie, the app sees no change in the returned seqIDs just because a flush occurred under the hood? right, except of the full flush I mentioned above. In general can you give a different name if the seqID was "coded" (<< 1) vs not? (maybe codedSeqID or something)? Just to reduce chance of future errors... yeah good point. I tried to not introduce a short living object here so I figured piggy-packing the seq. id is fine but yeah we should name that differently. If the perf hit is negligible I don't think we need to add an IWC option? its just like an update but we save the delete handling - some extra cpu cycles but since the other work is so much heavier I think its ok though.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development