Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14210

Add replica state option for HealthCheckHandler

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 8.5
    • 8.6
    • None
    • None

    Description

      Background

      As was brought up in SOLR-13055, in order to run Solr in a more cloud-native way, we need some additional features around node-level healthchecks.

      Like in Kubernetes we need 'liveliness' and 'readiness' probe explained in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n determine if a node is live and ready to serve live traffic.

       

      However there are issues around kubernetes managing it's own rolling restarts. With the current healthcheck setup, it's easy to envision a scenario in which Solr reports itself as "healthy" when all of its replicas are actually recovering. Therefore kubernetes, seeing a healthy pod would then go and restart the next Solr node. This can happen until all replicas are "recovering" and none are healthy. (maybe the last one restarted will be "down", but still there are no "active" replicas)

      Proposal

      I propose we make an additional healthcheck handler that returns whether all replicas hosted by that Solr node are healthy and "active". That way we will be able to use the default kubernetes rolling restart logic with Solr.

      To add on to Jan's point here, this handler should be more friendly for other Content-Types and should use bettter HTTP response statuses.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            janhoy Jan Høydahl Assign to me
            houston Houston Putman
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 4.5h
              4.5h

              Slack

                Issue deployment