Description
We have noticed a case where topology is stuck due to the mis-interpretation of backpressure messge:
At beginning, the topology ran fine but a downstream component had backpressure, so it sent backpressure signal to its upstream component, and the upstream component paused sending data to the downstream bolt.
Then the downstream component restarted (due to any reason, for example, killed by supervisor due to heartbeat timeout). When it came back up, it sends backpressure message to the upstream bolt. However, the upstream component didn't know how to interpret the backpressure message so it logs the below error and ignores the message.
2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1 [ERROR] Unexpected message from server: {worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[], nonBpTasks=[546, 790, 863]}
Then the downstream component will not receive any data from the upstream component, so it won't have any backpressure (since no data is sent to it), hence it won't send any backpressure update message to the upstream component. This leads to a dead situation that the upstream component thinks the downstream has backpressure so it paused sending data to it, while the downstream doesn't have backpressure but can't receive any data from upstream. The topology is stuck because of it.
Let's look at the code:
When the connection between the downstream (server) and upstream (client) is established,
server invokes
which sends backpressure messages to the client.
This is because in this pipeline, "StormServerHandler" is the only one with that implemented "channelActive()" method.
However, the Client side expects authentication messages.
so the client can't interpret the backpressure message at the beginning, hence "unexpected message".
This can be supported with an example. I have a wordcount topology running. At the startup, the client tries to connect to the server. Once connected, it sends a "SASL_TOKEN_MESSAGE_REQUEST".
client log
021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981 2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id: 0x29da2e9c, L:/10.215.73.209:45870 - R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12] 2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1 [INFO] Connection established from /10.215.73.209:45870 to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702 ... 2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c, L:/10.215.73.209:45870 - R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] 2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG] SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server 2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST ... 2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] send/recv time (ms): 277 2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1 [ERROR] Unexpected message from server: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12, 13]}
But the server sends the backpressure message first, before it deals with the SASL_TOKEN_MESSAGE_REQUEST message
server log
2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981 2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from resources 2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to new client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12, 13]} 2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] SASL credentials for storm topology wc is -8603731884381183101:-9091319821854384981 2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.maxCapacityPerThread: 4096 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.maxSharedCapacityFactor: 2 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.linkCapacity: 16 2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8 2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN] topology.backpressure.enable is a deprecated config please see class org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information. 2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf Netty-server-localhost-6702-worker-1 [DEBUG] -Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true 2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf Netty-server-localhost-6702-worker-1 [DEBUG] -Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true 2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default ResourceLeakDetector: org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b .... 2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id: 0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating now, with topology token: wc 2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token is: wc with authmethod DIGEST-MD5 2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating SaslDigestCallback handler with topology token: wc
This is a bug likely introduced in STORM-2306 (https://github.com/apache/storm/pull/2502). This willl happen on every topology when "storm.messaging.netty.authentication" is set true (It is false by default)
Attachments
Issue Links
- links to