Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3037

Add JvmPauseMonitor to ZooKeeper

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.4.0, 3.5.0
    • 3.6.0
    • contrib

    Description

      After a ZK crash, or client timeout sometimes it's hard to determine from the logs what happened. Knowing if ZK was responsive at the time would help a lot. For example, ZK might spend a lot of time waiting on GC (there is still some misconception that ZK is a storage).

      To help detect this, HADOOP already has a great tool called JVM Pause Monitor. (As the name suggest, it can be also used for monitoring, but it also helps post-mortem in a lot of cases). Basically it has a daemon that sleeps for one second, and if the sleep time exceeds the 1s by more than the threshold (1s: INFO, 10s: WARN by default - this can be configurable in our case, see below), it will alert/make a log entry. It can also monitor the time GC took.

      The class implementing this is in HADOOP-common, but ZK should not depend on this package. Since this is a straightforward implementation, and in the past five years the few commits it had is nothing really serious, I think we could just copy this class in ZooKeeper, and introduce it as a configurable feature, by default it can be off.

      The class:
      https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java

      Task:

      • Create a class in ZK (under zookeeper/server/util/) called JvmPauseMonitor.
      • Make feature configurable, by default: OFF
      • Make sleep time and threshold time configurable
      • Update documentation
      • Add [current size of the heap OR % of heap used] in the log entry whenever sleep threshold had exceeded by a lot (10s)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nkalmar Norbert Kalmár
            nkalmar Norbert Kalmár
            Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2.5h
                2.5h

                Slack

                  Issue deployment