Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Realtime Branch
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      With per-thread DocumentsWriters sharing doc stores across segments doesn't make much sense anymore.

      See also LUCENE-2324.

      1. lucene-2555.patch
        92 kB
        Michael Busch
      2. lucene-2555.patch
        88 kB
        Michael Busch

        Activity

        Hide
        Michael Busch added a comment -

        What shall we do about index backward-compatibility?

        I guess 4.0 has to be able to read shared doc stores? So a lot of that code we can't remove?

        Show
        Michael Busch added a comment - What shall we do about index backward-compatibility? I guess 4.0 has to be able to read shared doc stores? So a lot of that code we can't remove?
        Hide
        Jason Rutherglen added a comment -

        Maybe we should break backwards-compatibility for the RT branch? Or just ship an RT specific JAR to keep things simple?

        Show
        Jason Rutherglen added a comment - Maybe we should break backwards-compatibility for the RT branch? Or just ship an RT specific JAR to keep things simple?
        Hide
        Michael McCandless added a comment -

        The reading side of shared doc stores is quite trivial; I think we should keep it (keep back compat).

        Show
        Michael McCandless added a comment - The reading side of shared doc stores is quite trivial; I think we should keep it (keep back compat).
        Hide
        Shai Erera added a comment -

        What are the performance implications of removing shared doc stores? From what I understand, if several segments share the same doc store, then when they are merged, the doc stores aren't merged. Which is a great benefit, especially if you intend to store large fields.

        I understand (mostly from the discussion on the PTDW) that with the move to a per-thread approach, the doc stores cannot be shared between segments created by different threads, but what about segments created by the same thread? Are we losing that functionality?

        Show
        Shai Erera added a comment - What are the performance implications of removing shared doc stores? From what I understand, if several segments share the same doc store, then when they are merged, the doc stores aren't merged. Which is a great benefit, especially if you intend to store large fields. I understand (mostly from the discussion on the PTDW) that with the move to a per-thread approach, the doc stores cannot be shared between segments created by different threads, but what about segments created by the same thread? Are we losing that functionality?
        Hide
        Jason Rutherglen added a comment -

        Shai, I think Mike has outlined the pros and cons in other
        postings see:

        https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12891256&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acommenttabpanel#action_12891256

        Basically when going to DWPTs we're losing shared doc stores
        completely, and sharing between threads probably doesn't make
        sense. However we can keep the reading of shared doc stores for
        back compat. I think the confusing part in the code is the
        writing of shared doc stores, and I'm glad that's going away. In
        addition the DWPT code completely streamlines some of the most
        confusing parts of the IndexWriter class tree (the wait notify,
        and per thread logic in particular). Overall this will help
        future folks when they're trying to customize IndexWriter, and
        in addition, remove a layer of complexity, as we add yet another
        layer of complexity with the RT code.

        Show
        Jason Rutherglen added a comment - Shai, I think Mike has outlined the pros and cons in other postings see: https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12891256&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acommenttabpanel#action_12891256 Basically when going to DWPTs we're losing shared doc stores completely, and sharing between threads probably doesn't make sense. However we can keep the reading of shared doc stores for back compat. I think the confusing part in the code is the writing of shared doc stores, and I'm glad that's going away. In addition the DWPT code completely streamlines some of the most confusing parts of the IndexWriter class tree (the wait notify, and per thread logic in particular). Overall this will help future folks when they're trying to customize IndexWriter, and in addition, remove a layer of complexity, as we add yet another layer of complexity with the RT code.
        Hide
        Michael Busch added a comment -

        I understand (mostly from the discussion on the PTDW) that with the move to a per-thread approach, the doc stores cannot be shared between segments created by different threads, but what about segments created by the same thread? Are we losing that functionality?

        We discussed that in LUCENE-2324 (close to the bottom). The problem is that doc stores only help you if you merge segments that all share the same store. With DWPT that's extremely unlikely.

        What are the performance implications of removing shared doc stores?

        I agree we have to test this when this patch is complete. My hope is that we save in other places (removing the interleaving step of the per-thread postings, no wait queue that serializes writing to doc stores) so that overall we won't be slower.

        Show
        Michael Busch added a comment - I understand (mostly from the discussion on the PTDW) that with the move to a per-thread approach, the doc stores cannot be shared between segments created by different threads, but what about segments created by the same thread? Are we losing that functionality? We discussed that in LUCENE-2324 (close to the bottom). The problem is that doc stores only help you if you merge segments that all share the same store. With DWPT that's extremely unlikely. What are the performance implications of removing shared doc stores? I agree we have to test this when this patch is complete. My hope is that we save in other places (removing the interleaving step of the per-thread postings, no wait queue that serializes writing to doc stores) so that overall we won't be slower.
        Hide
        Michael McCandless added a comment -

        What are the performance implications of removing shared doc stores?

        I agree we have to test this when this patch is complete. My hope is that we save in other places (removing the interleaving step of the per-thread postings, no wait queue that serializes writing to doc stores) so that overall we won't be slower.

        Also, remember that shared doc stores is not as good an opto as it used to be, because we are now able to bulk-copy both stored fields and term vectors during merging.

        However, bulk merging only happens if the field name -> number mapping is congruent, b/w the merged segment and the one segment being merged.

        Unfortunately, you can easily unexpectedly break this (see LUCENE-1737) but eg adding diff't fields to your docs, or adding same fields just in a different order.

        Show
        Michael McCandless added a comment - What are the performance implications of removing shared doc stores? I agree we have to test this when this patch is complete. My hope is that we save in other places (removing the interleaving step of the per-thread postings, no wait queue that serializes writing to doc stores) so that overall we won't be slower. Also, remember that shared doc stores is not as good an opto as it used to be, because we are now able to bulk-copy both stored fields and term vectors during merging. However, bulk merging only happens if the field name -> number mapping is congruent, b/w the merged segment and the one segment being merged. Unfortunately, you can easily unexpectedly break this (see LUCENE-1737 ) but eg adding diff't fields to your docs, or adding same fields just in a different order.
        Hide
        Shai Erera added a comment -

        Thanks for the explanation. Let's remember though that not all apps are multi-threaded, but I think most are, so designing for the most is better than making the other few more performing. I'm generally ok with that, just wanted to better understand the reasons.

        Show
        Shai Erera added a comment - Thanks for the explanation. Let's remember though that not all apps are multi-threaded, but I think most are, so designing for the most is better than making the other few more performing. I'm generally ok with that, just wanted to better understand the reasons.
        Hide
        Michael Busch added a comment -

        Checkpointing what I have so far:

        • Removed writing part of shared doc stores
        • Reading part is kept for backwards compatibility. TestBackwardsCompatibility passes.
        • Removed DocumentsWriterPerThread.DocWriter and all its subclasses

        Some testcases fail still. Next I will look into removing PerDocBuffer, exception handling,
        and then fix the failing tests.

        Show
        Michael Busch added a comment - Checkpointing what I have so far: Removed writing part of shared doc stores Reading part is kept for backwards compatibility. TestBackwardsCompatibility passes. Removed DocumentsWriterPerThread.DocWriter and all its subclasses Some testcases fail still. Next I will look into removing PerDocBuffer, exception handling, and then fix the failing tests.
        Hide
        Jason Rutherglen added a comment -

        Michael, nice! A lot is cleaned up.

        Show
        Jason Rutherglen added a comment - Michael, nice! A lot is cleaned up.
        Hide
        Michael Busch added a comment -

        Changed the patch to also remove PerDocBuffer. It changes StoredFieldsWriter and TermVectorsTermsWriter to write the data directly to the final IndexOutput, without buffering it in a temporary PerDocBuffer.

        Several tests still fail due to exception handling or thread-safety problems (which is expected - haven't tried very hard to fix them yet). I will commit this patch to the realtime branch and work on fixing the tests with a separate issue.

        Show
        Michael Busch added a comment - Changed the patch to also remove PerDocBuffer. It changes StoredFieldsWriter and TermVectorsTermsWriter to write the data directly to the final IndexOutput, without buffering it in a temporary PerDocBuffer. Several tests still fail due to exception handling or thread-safety problems (which is expected - haven't tried very hard to fix them yet). I will commit this patch to the realtime branch and work on fixing the tests with a separate issue.
        Hide
        Michael Busch added a comment -

        Committed revision 978805.

        Show
        Michael Busch added a comment - Committed revision 978805.

          People

          • Assignee:
            Michael Busch
            Reporter:
            Michael Busch
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development