Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3763

Backpressure message ignored by the receiver caused the topology to not progress

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.1.0, 2.2.0
    • 2.3.0, 2.2.1
    • None
    • None

    Description

      We have noticed a case where topology is stuck due to the mis-interpretation of backpressure messge:

      At beginning, the topology ran fine but a downstream component had backpressure, so it sent backpressure signal to its upstream component, and the upstream component paused sending data to the downstream bolt.
      Then the downstream component restarted (due to any reason, for example, killed by supervisor due to heartbeat timeout). When it came back up, it sends backpressure message to the upstream bolt. However, the upstream component didn't know how to interpret the backpressure message so it logs the below error and ignores the message.

      2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1 [ERROR] Unexpected message from server: {worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[], nonBpTasks=[546, 790, 863]}
      

      Then the downstream component will not receive any data from the upstream component, so it won't have any backpressure (since no data is sent to it), hence it won't send any backpressure update message to the upstream component. This leads to a dead situation that the upstream component thinks the downstream has backpressure so it paused sending data to it, while the downstream doesn't have backpressure but can't receive any data from upstream. The topology is stuck because of it.

      Let's look at the code:

      When the connection between the downstream (server) and upstream (client) is established,
      server invokes

      https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerHandler.java#L39-L41

      https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java#L237

      which sends backpressure messages to the client.

      This is because in this pipeline, "StormServerHandler" is the only one with that implemented "channelActive()" method.

      https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerPipelineFactory.java#L56

      However, the Client side expects authentication messages.

      https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormClientHandler.java#L70-L75

      so the client can't interpret the backpressure message at the beginning, hence "unexpected message".

      This can be supported with an example. I have a wordcount topology running. At the startup, the client tries to connect to the server. Once connected, it sends a "SASL_TOKEN_MESSAGE_REQUEST".

      client log

      021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981
      2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id: 0x29da2e9c, L:/10.215.73.209:45870 - R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12]
      2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1 [INFO] Connection established from /10.215.73.209:45870 to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702
      ...
      2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c, L:/10.215.73.209:45870 - R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702]
      2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG] SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server
      2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST
      
      ...
      
      2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] send/recv time (ms): 277
      2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1 [ERROR] Unexpected message from server: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12,
      13]}
      

      But the server sends the backpressure message first, before it deals with the SASL_TOKEN_MESSAGE_REQUEST message

      server log

      2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981
      2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from resources
      2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to new client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], nonBpTasks=[17,
      1, 18, 3, 4, 22, 7, 8, 9, 12, 13]}
      2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981
      2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.maxCapacityPerThread: 4096
      2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.maxSharedCapacityFactor: 2
      2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.linkCapacity: 16
      2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8
      2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN] topology.backpressure.enable is a deprecated config please see class org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information.
      2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf Netty-server-localhost-6702-worker-1 [DEBUG] -Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true
      2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf Netty-server-localhost-6702-worker-1 [DEBUG] -Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true
      2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default ResourceLeakDetector: org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b
      
      
      ....
      
      2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id: 0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating now, with topology token: wc
      2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token is: wc with authmethod DIGEST-MD5
      2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating SaslDigestCallback handler with topology token: wc
      

      https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormServerHandler.java#L46-L52

      This is a bug likely introduced in STORM-2306 (https://github.com/apache/storm/pull/2502). This willl happen on every topology when "storm.messaging.netty.authentication" is set true (It is false by default)

      Attachments

        Issue Links

          Activity

            People

              ethanli Ethan Li
              ethanli Ethan Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m