Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3026

tserver: refuse to host another tablet replica if the number of open file descriptors is close to the limit

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      In the case of even replica distribution across all available nodes, once one tablet server hits the maximum number of open file descriptors and go down (e.g., upon hosting another tablet replica), the system will automatically re-replicate tablet replicas from the tablet server, most likely bringing other tablet servers down as well. That's a cascading failure scenario that nobody wants.

      It would be great to change the behavior of tablet servers so they refuse to host another tablet replica if they sense that their resource usage is almost exhausted. The number of open file descriptors is a very good first concrete step towards that goal. That's something similar to the memory pressure-induced rejections behavior, but for the different sort of resource.

      The system catalog (master) and other related components should be updated to react appropriately once receiving a rejection to host an additional tablet replica. Also, extra provisions to help with monitoring the number of open file descriptors vs the limit (KUDU-3025) should be implemented to help in detecting and prevent such issues proactively.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aserbin Alexey Serbin

              Dates

              • Created:
                Updated:

                Issue deployment