Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3622

Race Condition in CachedThreadStatesGaugeSet registered at SystemBolt

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      We noticed that with the change in https://github.com/apache/storm/pull/3242, there is a race condition causing NPE.

      2020-04-14 18:22:12.997 o.a.s.u.Utils Thread-17-__acker-executor[16, 16] [ERROR] Async loop died!
      java.lang.RuntimeException: java.lang.NullPointerException
       at org.apache.storm.executor.Executor.accept(Executor.java:291) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:131) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.utils.JCQueue.consume(JCQueue.java:111) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:172) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:159) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.utils.Utils$1.run(Utils.java:434) [storm-client-2.2.0.y.jar:2.2.0.y]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
      Caused by: java.lang.NullPointerException
       at com.codahale.metrics.jvm.ThreadStatesGaugeSet.getThreadCount(ThreadStatesGaugeSet.java:95) ~[metrics-jvm-3.2.6.jar:3.2.6]
       at com.codahale.metrics.jvm.ThreadStatesGaugeSet.access$000(ThreadStatesGaugeSet.java:20) ~[metrics-jvm-3.2.6.jar:3.2.6]
       at com.codahale.metrics.jvm.ThreadStatesGaugeSet$1.getValue(ThreadStatesGaugeSet.java:56) ~[metrics-jvm-3.2.6.jar:3.2.6]
       at org.apache.storm.executor.Executor.addV2Metrics(Executor.java:344) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.executor.Executor.metricsTick(Executor.java:320) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:218) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       at org.apache.storm.executor.Executor.accept(Executor.java:287) ~[storm-client-2.2.0.y.jar:2.2.0.y]
       ... 6 more
      

      This is due to a race condition in CachedGauge https://github.com/dropwizard/metrics/blob/v3.2.6/metrics-core/src/main/java/com/codahale/metrics/CachedGauge.java#L49-L53

      There are two issues here.
      The first one is STORM-3623.
      This makes all the executors to get values for all the metrics. So multiple threads will access the same metric.

      So the threads gauges are now accessed by multiple threads. But in CachedGauge,

       @Override
          public T getValue() {
              if (shouldLoad()) {
                  this.value = loadValue();
              }
              return value;
          }
      

      this method is not thread-safe. Two threads can reach to getValue at the same time.
      The first thread reaching shouldLoad knows it needs to reload, so it calls the next line this.value=loadValue()
      The second thread is a little bit late so shouldLoad returns false. Then it returns the value directly.

      There is a race condition between first thread calling loadValue() and the second thread returning value.

      If the first thread finishes loadValue() first, both values returned to the threads are the same value (and current value). But if the second thread returns earlier, the second thread gets the original value (which is null ), hence NPE.

      To summarize, the second issue is CachedThreadStatesGaugeSet is not thread-safe

      To fix this NPE, we should avoid using CachedThreadStatesGaugeSet.

      But we still need to fix STORM-3623 to avoid unnecessary computations and redundant metrics.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ethanli Ethan Li
            ethanli Ethan Li
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h

                Slack

                  Issue deployment