Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: Realtime Branch
    • Fix Version/s: 5.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Utilizing the sequence ids created via the update document
      methods, we will enable IndexReader deleted docs over a sequence
      id array.

      One of the decisions is what primitive type to use. We can start
      off with an int[], then possibly move to a short[] (for lower
      memory consumption) that wraps around.

        Activity

        Hide
        Jason Rutherglen added a comment -

        I tried to start on this however, nothing can be deleted without the terms dictionary and the terms docs working in order to obtain the doc ids to delete.

        Show
        Jason Rutherglen added a comment - I tried to start on this however, nothing can be deleted without the terms dictionary and the terms docs working in order to obtain the doc ids to delete.
        Hide
        Michael McCandless added a comment -

        Resolving deleted terms -> doc IDs doesn't require a sorted terms dict right? Ie a simple hash lookup suffices?

        Show
        Michael McCandless added a comment - Resolving deleted terms -> doc IDs doesn't require a sorted terms dict right? Ie a simple hash lookup suffices?
        Hide
        Jason Rutherglen added a comment -

        Resolving deleted terms -> doc IDs doesn't require a
        sorted terms dict right? Ie a simple hash lookup suffices?

        True, however I figured it'd be best to try our own dog food, or
        APIs. I think the main issue right now is the concurrency of the
        *BlockPools from LUCENE-2575. Then we should be able to
        implement deleting, which doesn't require skip lists. I guess if
        we really wanted to, we could simply buffer terms and only apply
        them in getReader. getReader would block any writes that could
        be altering the *BlockPools. Maybe this is a good first step? Is there
        any reason we need to apply deletes in the actual updateDoc and
        deleteDoc methods?

        Show
        Jason Rutherglen added a comment - Resolving deleted terms -> doc IDs doesn't require a sorted terms dict right? Ie a simple hash lookup suffices? True, however I figured it'd be best to try our own dog food, or APIs. I think the main issue right now is the concurrency of the *BlockPools from LUCENE-2575 . Then we should be able to implement deleting, which doesn't require skip lists. I guess if we really wanted to, we could simply buffer terms and only apply them in getReader. getReader would block any writes that could be altering the *BlockPools. Maybe this is a good first step? Is there any reason we need to apply deletes in the actual updateDoc and deleteDoc methods?
        Hide
        Jason Rutherglen added a comment -

        I'm implementing a basic doc id iterator per DWPT which will allow us to implement delete by term, and the deleted docs sequence ids. This is for merging of segments? However we're using readers to do the merging so this really won't be useful?

        Show
        Jason Rutherglen added a comment - I'm implementing a basic doc id iterator per DWPT which will allow us to implement delete by term, and the deleted docs sequence ids. This is for merging of segments? However we're using readers to do the merging so this really won't be useful?
        Hide
        Jason Rutherglen added a comment -

        For the deleted docs sequence id array, perhaps I'm a little bit
        confused, but how will we signify in the sequence id array if a
        document is deleted? I believe we need a secondary sequence id
        array for deleted docs that is init'd to -1. When a document is
        deleted, the sequence id is set for that doc in the
        del-docs-seq-arr. When the deleted docs Bits is being accessed,
        for a given doc, we'll compare the IRs seq-id-up-to with the
        del-docs-seq-id, and if the IR seq-id is greater than or equal
        to, the Bits.get method will return true, meaning the document
        is deleted.

        I am forgetting how concurrency will work in this case, ie,
        insuring multi-threaded visibility due to the JMM. Actually,
        because we're pausing the writes/deletes when get reader is
        called on the DWPT, JMM concurrency should be OK.

        Show
        Jason Rutherglen added a comment - For the deleted docs sequence id array, perhaps I'm a little bit confused, but how will we signify in the sequence id array if a document is deleted? I believe we need a secondary sequence id array for deleted docs that is init'd to -1. When a document is deleted, the sequence id is set for that doc in the del-docs-seq-arr. When the deleted docs Bits is being accessed, for a given doc, we'll compare the IRs seq-id-up-to with the del-docs-seq-id, and if the IR seq-id is greater than or equal to, the Bits.get method will return true, meaning the document is deleted. I am forgetting how concurrency will work in this case, ie, insuring multi-threaded visibility due to the JMM. Actually, because we're pausing the writes/deletes when get reader is called on the DWPT, JMM concurrency should be OK.
        Hide
        Jason Rutherglen added a comment -

        If we implement deletes via sequence id across all segments, then the .del file should probably remain the same (a set of bits)? Also, when we load up the BV on IW start, then I guess we'll need to init the array appropriately.

        Show
        Jason Rutherglen added a comment - If we implement deletes via sequence id across all segments, then the .del file should probably remain the same (a set of bits)? Also, when we load up the BV on IW start, then I guess we'll need to init the array appropriately.
        Hide
        Michael McCandless added a comment -

        We could also [someday] move deletes to a stacked model... where we only write "deltas" (newly deleted docs in the current session) against the segment, and on open we coalesce these. Merging would also periodically coalesce and write a new full vector...

        Show
        Michael McCandless added a comment - We could also [someday] move deletes to a stacked model... where we only write "deltas" (newly deleted docs in the current session) against the segment, and on open we coalesce these. Merging would also periodically coalesce and write a new full vector...
        Hide
        Jason Rutherglen added a comment -

        In regards to the deltas, when they're in RAM (ie, for norm and DF updates), I'm guessing we'd need to place the updates into a hash map (that hopefully uses primitives instead of objects to save RAM)? We could instantiate a new array when the map reached a certain size?

        Show
        Jason Rutherglen added a comment - In regards to the deltas, when they're in RAM (ie, for norm and DF updates), I'm guessing we'd need to place the updates into a hash map (that hopefully uses primitives instead of objects to save RAM)? We could instantiate a new array when the map reached a certain size?
        Hide
        Michael McCandless added a comment -

        In regards to the deltas, when they're in RAM (ie, for norm and DF updates), I'm guessing we'd need to place the updates into a hash map (that hopefully uses primitives instead of objects to save RAM)? We could instantiate a new array when the map reached a certain size?

        Actually I think all lookups for a del doc should still be against the BV.

        The "generations"/replay log would only be used to properly do the recycling of an old BV (ie, so you know which parts of the log to "replay" against this BV).

        And, for saving the new deletes in the directory (though this is not really important for the RT case).

        Show
        Michael McCandless added a comment - In regards to the deltas, when they're in RAM (ie, for norm and DF updates), I'm guessing we'd need to place the updates into a hash map (that hopefully uses primitives instead of objects to save RAM)? We could instantiate a new array when the map reached a certain size? Actually I think all lookups for a del doc should still be against the BV. The "generations"/replay log would only be used to properly do the recycling of an old BV (ie, so you know which parts of the log to "replay" against this BV). And, for saving the new deletes in the directory (though this is not really important for the RT case).
        Hide
        hao yan added a comment -

        Does anybody know where to checkout the realtime branch? I am very interested in it! Thanks!

        Show
        hao yan added a comment - Does anybody know where to checkout the realtime branch? I am very interested in it! Thanks!
        Hide
        Simon Willnauer added a comment -

        Does anybody know where to checkout the realtime branch? I am very interested in it! Thanks!

        there is no realtime branch open right now. We had to delete it since we re-integrated it for DocumentsWriterPerThread. (SVN requires that once you have re-integrated) However, there is no development happening along those lines right now and we didn't decide if we move forward since for general purpose the NRT features we have is reasonably fast. Anyway, I think there is still a need for this if we can provide it as a non-default option?

        Show
        Simon Willnauer added a comment - Does anybody know where to checkout the realtime branch? I am very interested in it! Thanks! there is no realtime branch open right now. We had to delete it since we re-integrated it for DocumentsWriterPerThread. (SVN requires that once you have re-integrated) However, there is no development happening along those lines right now and we didn't decide if we move forward since for general purpose the NRT features we have is reasonably fast. Anyway, I think there is still a need for this if we can provide it as a non-default option?

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development