Solr
  1. Solr
  2. SOLR-4997

The splitshard api doesn't call commit on new sub shards

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3, 4.3.1
    • Fix Version/s: 4.4
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The splitshard api doesn't call commit on new sub shards but it happily sets them to active state which means on a successful split, the documents are not visible to searchers unless an explicit commit is called on the cluster.

      The coreadmin split api will still not call commit on targetCores. That is by design and we're not going to change that.

      1. SOLR-4997.patch
        21 kB
        Shalin Shekhar Mangar
      2. SOLR-4997.patch
        11 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        Hmm, we don't know the name of the update handler inside the Overseer Collection Processor so we can do one of the following:

        1. Assume it is /update
        2. Pass in the update handler name in the OverseerCollectionProcessor constructor
        3. Change CoreAdmin request apply updates action to accept a commit param
        Show
        Shalin Shekhar Mangar added a comment - Hmm, we don't know the name of the update handler inside the Overseer Collection Processor so we can do one of the following: Assume it is /update Pass in the update handler name in the OverseerCollectionProcessor constructor Change CoreAdmin request apply updates action to accept a commit param
        Hide
        Mark Miller added a comment - - edited

        1. -1

        Something along 2 or 3 lines seems way preferable.

        Show
        Mark Miller added a comment - - edited 1. -1 Something along 2 or 3 lines seems way preferable.
        Hide
        Shalin Shekhar Mangar added a comment -

        Agreed. I am going to use 3 just because it is something I control. Configurable handlers for such basic things are a pain.

        Show
        Shalin Shekhar Mangar added a comment - Agreed. I am going to use 3 just because it is something I control. Configurable handlers for such basic things are a pain.
        Hide
        Shalin Shekhar Mangar added a comment -

        There is no way to match the last commit point on the parent exactly. The best we can do is to commit right before we switch states. So users must still call a commit after splitshard but at least they will get more up to date results.

        Show
        Shalin Shekhar Mangar added a comment - There is no way to match the last commit point on the parent exactly. The best we can do is to commit right before we switch states. So users must still call a commit after splitshard but at least they will get more up to date results.
        Hide
        Mis Tigi added a comment -

        "at least they will get more up to date results."

        Does it mean that if there is update activity outside of CoreAdmin API during SPLISTSHARD call the resulting splitted shard may be in inconsistent state ?

        In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ?

        Show
        Mis Tigi added a comment - "at least they will get more up to date results." Does it mean that if there is update activity outside of CoreAdmin API during SPLISTSHARD call the resulting splitted shard may be in inconsistent state ? In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ?
        Hide
        Shalin Shekhar Mangar added a comment -

        In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ?

        For Solr 4.3.1 – no operations are reflected on subshards until you call commit after splitshard. After this issue is fixed (with Solr 4.4), all indexing operations performed till the completion of splitshard will be reflected in sub shards.

        Show
        Shalin Shekhar Mangar added a comment - In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ? For Solr 4.3.1 – no operations are reflected on subshards until you call commit after splitshard. After this issue is fixed (with Solr 4.4), all indexing operations performed till the completion of splitshard will be reflected in sub shards.
        Hide
        Yonik Seeley added a comment -

        In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ?

        The phrase "may not be reflected" is bound to be confusing to others reading this issue.
        You won't lose any updates... this is more about update visibility.

        Show
        Yonik Seeley added a comment - In other words, if I am actively indexing on the collection while I am performing SPLITSHARD call, some of the indexing operation may not be reflected in my collection after SPLITSHARD completes ? The phrase "may not be reflected" is bound to be confusing to others reading this issue. You won't lose any updates... this is more about update visibility.
        Hide
        Shalin Shekhar Mangar added a comment -

        You won't lose any updates... this is more about update visibility.

        Yes, exactly. A side effect of this bug is that the newly created sub shards are inconsistent in terms of update visibility. The leader of the sub shard will not have new updates visible (depending on auto commit settings) but all other replicas will have new updates visible because they are created after the split.

        Show
        Shalin Shekhar Mangar added a comment - You won't lose any updates... this is more about update visibility. Yes, exactly. A side effect of this bug is that the newly created sub shards are inconsistent in terms of update visibility. The leader of the sub shard will not have new updates visible (depending on auto commit settings) but all other replicas will have new updates visible because they are created after the split.
        Hide
        Shalin Shekhar Mangar added a comment - - edited

        I found another bug while investigating this. Any core starting up as part of a "construction" shard was set to skip log replay but it also skipped recovery. So a subshard replica would publish itself as active without having any documents.

        I also wasted a lot of time trying to figure out why a commit request would not work until I realized that HttpShardHandler is hardcoded to send QueryRequest so it cannot be used to issue a commit.

        1. Fix for above issue – Only skip log replay but not recovery for cores belonging to "construction" state shards
        2. Call distrib commit before we switch shard states
        3. Tests for shard consistency

        The test currently fails with sub shard leader having more docs than its replicas and I'm still trying to figure out why.

        Show
        Shalin Shekhar Mangar added a comment - - edited I found another bug while investigating this. Any core starting up as part of a "construction" shard was set to skip log replay but it also skipped recovery. So a subshard replica would publish itself as active without having any documents. I also wasted a lot of time trying to figure out why a commit request would not work until I realized that HttpShardHandler is hardcoded to send QueryRequest so it cannot be used to issue a commit. Fix for above issue – Only skip log replay but not recovery for cores belonging to "construction" state shards Call distrib commit before we switch shard states Tests for shard consistency The test currently fails with sub shard leader having more docs than its replicas and I'm still trying to figure out why.
        Hide
        Shalin Shekhar Mangar added a comment -

        I think I have found the cause of the failures. It has to do with sub shard replication. I'm working on a test + fix.

        Show
        Shalin Shekhar Mangar added a comment - I think I have found the cause of the failures. It has to do with sub shard replication. I'm working on a test + fix.
        Hide
        Shalin Shekhar Mangar added a comment -

        This took some time to track down.

        Changes

        1. Overseer collection processor calls commit after all replicas are created and before it switches shard states
        2. ZkController is modified to have only sub shard leaders skip log recovery on startup. Sub shard replicas will recover from leader before they publish themselves as active
        3. Changes in DistributedUpdateProcessor to detect when the current node is a sub shard leader and it should forward updates to its replicas
        4. ShardSplitTests tests for shard consistency as well before calling a global commit and testing for final correctness
        5. Fixed SolrCmdDistributor.syncRequest to log the correct exception

        All tests pass.

        This patch fixes three (related) bugs in total.

        1. We don't commit on sub shards because of which sub shards don't have all docs visible
        2. Sub shard replicas skip recovering from leader because of which they show no docs upon creation
        3. Sub shard replicas can lose updates from leader between the time they are created and the time the shard becomes active

        The first two have easy fixes. The last one required invasive changes.

        I'll put up this patch in case someone wants to review and commit it tomorrow morning my time (IST).

        I'd like to include this in 4.4 since it fixes major bugs but I'm not sure if we have enough time to let this bake. Perhaps if we cut the RC on Monday instead of Friday, we can let jenkins test it for a while? Review comments are welcome.

        Show
        Shalin Shekhar Mangar added a comment - This took some time to track down. Changes Overseer collection processor calls commit after all replicas are created and before it switches shard states ZkController is modified to have only sub shard leaders skip log recovery on startup. Sub shard replicas will recover from leader before they publish themselves as active Changes in DistributedUpdateProcessor to detect when the current node is a sub shard leader and it should forward updates to its replicas ShardSplitTests tests for shard consistency as well before calling a global commit and testing for final correctness Fixed SolrCmdDistributor.syncRequest to log the correct exception All tests pass. This patch fixes three (related) bugs in total. We don't commit on sub shards because of which sub shards don't have all docs visible Sub shard replicas skip recovering from leader because of which they show no docs upon creation Sub shard replicas can lose updates from leader between the time they are created and the time the shard becomes active The first two have easy fixes. The last one required invasive changes. I'll put up this patch in case someone wants to review and commit it tomorrow morning my time (IST). I'd like to include this in 4.4 since it fixes major bugs but I'm not sure if we have enough time to let this bake. Perhaps if we cut the RC on Monday instead of Friday, we can let jenkins test it for a while? Review comments are welcome.
        Hide
        Yonik Seeley added a comment -

        Perhaps if we cut the RC on Monday instead of Friday

        I believe that was Steve's original plan anyway.

        Show
        Yonik Seeley added a comment - Perhaps if we cut the RC on Monday instead of Friday I believe that was Steve's original plan anyway.
        Hide
        ASF subversion and git services added a comment -

        Commit 1502458 from shalin@apache.org
        [ https://svn.apache.org/r1502458 ]

        SOLR-4997: The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.

        Show
        ASF subversion and git services added a comment - Commit 1502458 from shalin@apache.org [ https://svn.apache.org/r1502458 ] SOLR-4997 : The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.
        Hide
        ASF subversion and git services added a comment -

        Commit 1502460 from shalin@apache.org
        [ https://svn.apache.org/r1502460 ]

        SOLR-4997: The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.

        Show
        ASF subversion and git services added a comment - Commit 1502460 from shalin@apache.org [ https://svn.apache.org/r1502460 ] SOLR-4997 : The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.
        Hide
        Erick Erickson added a comment -

        Shalin:

        I see there's a 4_4 branch already, are we sure it's on that branch too? Just don't want this to be lost in the shuffle.

        Show
        Erick Erickson added a comment - Shalin: I see there's a 4_4 branch already, are we sure it's on that branch too? Just don't want this to be lost in the shuffle.
        Hide
        ASF subversion and git services added a comment -

        Commit 1503019 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1503019 ]

        SOLR-4997: Call commit before checking shard consistency

        Show
        ASF subversion and git services added a comment - Commit 1503019 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1503019 ] SOLR-4997 : Call commit before checking shard consistency
        Hide
        ASF subversion and git services added a comment -

        Commit 1503020 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1503020 ]

        SOLR-4997: Call commit before checking shard consistency

        Show
        ASF subversion and git services added a comment - Commit 1503020 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1503020 ] SOLR-4997 : Call commit before checking shard consistency
        Hide
        ASF subversion and git services added a comment -

        Commit 1503130 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4'
        [ https://svn.apache.org/r1503130 ]

        SOLR-4997: The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.

        Show
        ASF subversion and git services added a comment - Commit 1503130 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4' [ https://svn.apache.org/r1503130 ] SOLR-4997 : The splitshard api doesn't call commit on new sub shards before switching shard states. Multiple bugs related to sub shard recovery and replication are also fixed.
        Hide
        ASF subversion and git services added a comment -

        Commit 1503131 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4'
        [ https://svn.apache.org/r1503131 ]

        SOLR-4997: Call commit before checking shard consistency

        Show
        ASF subversion and git services added a comment - Commit 1503131 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4' [ https://svn.apache.org/r1503131 ] SOLR-4997 : Call commit before checking shard consistency
        Hide
        ASF subversion and git services added a comment -

        Commit 1503328 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1503328 ]

        SOLR-4997: Skip log recovery for sub shard leaders only

        Show
        ASF subversion and git services added a comment - Commit 1503328 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1503328 ] SOLR-4997 : Skip log recovery for sub shard leaders only
        Hide
        ASF subversion and git services added a comment -

        Commit 1503331 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1503331 ]

        SOLR-4997: Skip log recovery for sub shard leaders only

        Show
        ASF subversion and git services added a comment - Commit 1503331 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1503331 ] SOLR-4997 : Skip log recovery for sub shard leaders only
        Hide
        ASF subversion and git services added a comment -

        Commit 1503332 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4'
        [ https://svn.apache.org/r1503332 ]

        SOLR-4997: Skip log recovery for sub shard leaders only

        Show
        ASF subversion and git services added a comment - Commit 1503332 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_4' [ https://svn.apache.org/r1503332 ] SOLR-4997 : Skip log recovery for sub shard leaders only
        Hide
        Shalin Shekhar Mangar added a comment -

        I fixed a bug that I had introduced which skipped log recovery on startup for all leaders instead of only sub shard leaders. I caught this only because I was doing another line-by-line review of all my changes. We should have a test which catches such a condition.

        Show
        Shalin Shekhar Mangar added a comment - I fixed a bug that I had introduced which skipped log recovery on startup for all leaders instead of only sub shard leaders. I caught this only because I was doing another line-by-line review of all my changes. We should have a test which catches such a condition.
        Hide
        Mark Miller added a comment -

        We should have a test which catches such a condition.

        yeah, scary.

        Show
        Mark Miller added a comment - We should have a test which catches such a condition. yeah, scary.
        Hide
        Shalin Shekhar Mangar added a comment -

        I opened SOLR-5041 for the test task.

        Show
        Shalin Shekhar Mangar added a comment - I opened SOLR-5041 for the test task.
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development