Uploaded image for project: 'Hadoop Distributed Data Store'
  1. Hadoop Distributed Data Store
  2. HDDS-4071

Add a long hold lock log and metrics to monitor the unexpected behavier.

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.0
    • Fix Version/s: None
    • Component/s: Ozone Manager
    • Labels:
      None

      Description

      Some time, a volume or bucket lock can hold for minute for a single request, that behavior can make the Ozone cluster getting into low performance state until the lock released.

      So, if I can monitor the lock hold time, I can know well which operation hold so much time unexpected.

      We can do the following for monitor lock hold time.

      • We can specify a threshold for long lock hold time, and remember the start time of lock, before unlock, calc the lock hold time, if it exceed the threshold, we can output a warn log contains the lock context, and metrics it for alert purpose.
      • We can create a monitor thread to scan the active lock, and give a scan report for each loop, shows that the long holding lock has been found.
      • Auditlog log each Operation Lock hold time, in case there are a huge number of related operation cost much lock time and each one cost short lock time.

        Attachments

          Activity

            People

            • Assignee:
              maobaolong Baolong Mao
              Reporter:
              maobaolong Baolong Mao
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: