Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10761

MetricGroup#getAllVariables can deadlock

    XMLWordPrintableJSON

    Details

      Description

      AbstractMetricGroup#getAllVariables acquires the locks of both the current and all parent groups when assembling the variables map. This can lead to a deadlock if metrics are registered concurrently on a child and parent if the child registration is applied first and the reporter uses said method (which many do).

      Assume we have a MetricGroup Mc(hild) and Mp(arent).

      2 separate threads Tc and Tp each register a metric on their respective group, acquiring the lock.
      Let's assume that Tc has a slight headstart.
      Tc will now call MetricRegistry#register first, acquiring the MR lock.
      Tp will block on this lock.

      Tc now iterates over all reporters calling MetricReporter#notifyOfAddedMetric. Assume that in this method MetricGroup#getAllVariables is called on Mc by Tc.
      Tc still holds the lock to Mc, and attempts to acquire the lock to Mp.
      The lock to Mp is still held by Tp however, which waits for the MR lock to be released by Tc.

      Thus a deadlock is created. This may deadlock anything, be it minor threads, tasks, or entire components.

      This has not surfaced so far since usually metrics are no longer added to a group once children have been created (since the component initialization at that point is complete).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chesnay Chesnay Schepler
                Reporter:
                chesnay Chesnay Schepler
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m