Solr
  1. Solr
  2. SOLR-5860

Logging around core wait for state during startup / recovery is confusing

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I'm seeing some log messages like this:

      I was asked to wait on state recovering for HOST:8984_solr but I still do not see the requested state. I see state: recovering live:true

      This is very confusing because from the log, it seems like it's waiting to see the state it's in ... After digging through the code, it appears that it is really waiting for a leader to become active so that it has a leader to recover from.

      I'd like to improve the logging around this critical wait loop to give better context to what is happening.

      Also, I would like to change the following so that we force state updates every 15 seconds for the entire wait period.

      • if (retry == 15 || retry == 60) {
        + if (retry % 15 == 0) {

      As-is, it's waiting 120 seconds but only forcing the state to update twice, once after 15 seconds and again after 60 … might be good to force updates for the full wait period.

      Lastly, I think it would be good to use the leaderConflictResolveWait setting (from ZkController) here as well since 120 may not be enough for a leader to become active in a busy cluster, esp. after the node the Overseer is running on. Maybe leaderConflictResolveWait + 5 seconds?

      1. SOLR-5860.patch
        4 kB
        Timothy Potter

        Activity

        Hide
        Timothy Potter added a comment -

        Here's a patch that does a couple of things:

        1) uses the leaderConflictResolveWait configuration property (from ZkController) as the max wait for this loop as it seems related to the cluster waiting to get the leader resolved for this core

        2) forces a refresh of state from ZooKeeper every 15 seconds for the duration of the wait loop

        3) logs it's activity every 15 seconds as well (so we know it is still waiting, esp when using a 3 minute timeout)

        4) tries to get some information from ClusterState about the leader when generating the exception message ... this might not be useful but I wanted to give more context in the error message so that messages like "still waiting on state recovering, i see state recovering" make more sense

        I didn't see in the existing tests where I could test this behavior explicitly. I did run the ChaosMonkeyNothingIsSafeTest and it invokes this code and passes.

        Show
        Timothy Potter added a comment - Here's a patch that does a couple of things: 1) uses the leaderConflictResolveWait configuration property (from ZkController) as the max wait for this loop as it seems related to the cluster waiting to get the leader resolved for this core 2) forces a refresh of state from ZooKeeper every 15 seconds for the duration of the wait loop 3) logs it's activity every 15 seconds as well (so we know it is still waiting, esp when using a 3 minute timeout) 4) tries to get some information from ClusterState about the leader when generating the exception message ... this might not be useful but I wanted to give more context in the error message so that messages like "still waiting on state recovering, i see state recovering" make more sense I didn't see in the existing tests where I could test this behavior explicitly. I did run the ChaosMonkeyNothingIsSafeTest and it invokes this code and passes.
        Hide
        Mark Miller added a comment -

        I've touched on this area working on SOLR-5884 as well - this is more thoughtful stuff though, so would be great to get it in before I commit SOLR-5884.

        Show
        Mark Miller added a comment - I've touched on this area working on SOLR-5884 as well - this is more thoughtful stuff though, so would be great to get it in before I commit SOLR-5884 .
        Hide
        Shalin Shekhar Mangar added a comment -

        Yes, the patch looks good to me. I'll commit after running all tests.

        Show
        Shalin Shekhar Mangar added a comment - Yes, the patch looks good to me. I'll commit after running all tests.
        Hide
        Shalin Shekhar Mangar added a comment -

        I'm seeing some test failures with the patch. Ran it twice already. I have to call it a day but if nobody else gets to it first, I'll debug tomorrow and commit.

        [junit4] Tests with failures:
        [junit4] - org.apache.solr.handler.component.TermVectorComponentDistributedTest.testDistribSearch
        [junit4] - org.apache.solr.handler.component.DistributedExpandComponentTest.testDistribSearch
        [junit4] - org.apache.solr.handler.component.DistributedSuggestComponentTest.testDistribSearch
        [junit4] - org.apache.solr.TestDistributedGrouping.testDistribSearch
        [junit4] - org.apache.solr.handler.component.DistributedTermsComponentTest.testDistribSearch
        [junit4] - org.apache.solr.handler.component.DistributedSpellCheckComponentTest.testDistribSearch
        [junit4] - org.apache.solr.TestDistributedMissingSort.testDistribSearch
        [junit4] - org.apache.solr.TestDistributedSearch.testDistribSearch
        [junit4] - org.apache.solr.handler.component.DistributedQueryComponentCustomSortTest.testDistribSearch
        [junit4]

        Show
        Shalin Shekhar Mangar added a comment - I'm seeing some test failures with the patch. Ran it twice already. I have to call it a day but if nobody else gets to it first, I'll debug tomorrow and commit. [junit4] Tests with failures: [junit4] - org.apache.solr.handler.component.TermVectorComponentDistributedTest.testDistribSearch [junit4] - org.apache.solr.handler.component.DistributedExpandComponentTest.testDistribSearch [junit4] - org.apache.solr.handler.component.DistributedSuggestComponentTest.testDistribSearch [junit4] - org.apache.solr.TestDistributedGrouping.testDistribSearch [junit4] - org.apache.solr.handler.component.DistributedTermsComponentTest.testDistribSearch [junit4] - org.apache.solr.handler.component.DistributedSpellCheckComponentTest.testDistribSearch [junit4] - org.apache.solr.TestDistributedMissingSort.testDistribSearch [junit4] - org.apache.solr.TestDistributedSearch.testDistribSearch [junit4] - org.apache.solr.handler.component.DistributedQueryComponentCustomSortTest.testDistribSearch [junit4]
        Hide
        Shalin Shekhar Mangar added a comment -

        Okay, the failures were unrelated. I just got two clean test passes. I'll commit this shortly.

        Show
        Shalin Shekhar Mangar added a comment - Okay, the failures were unrelated. I just got two clean test passes. I'll commit this shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1579954 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1579954 ]

        SOLR-5860: Use leaderConflictResolveWait in WaitForState during recovery/startup, improve logging and force refresh cluster state every 15 seconds

        Show
        ASF subversion and git services added a comment - Commit 1579954 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1579954 ] SOLR-5860 : Use leaderConflictResolveWait in WaitForState during recovery/startup, improve logging and force refresh cluster state every 15 seconds
        Hide
        ASF subversion and git services added a comment -

        Commit 1579956 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1579956 ]

        SOLR-5860: Use leaderConflictResolveWait in WaitForState during recovery/startup, improve logging and force refresh cluster state every 15 seconds

        Show
        ASF subversion and git services added a comment - Commit 1579956 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579956 ] SOLR-5860 : Use leaderConflictResolveWait in WaitForState during recovery/startup, improve logging and force refresh cluster state every 15 seconds
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Tim!

        Show
        Shalin Shekhar Mangar added a comment - Thanks Tim!
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Timothy Potter
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development