I've got a simple topology running with Storm 1.0.1. The topology consists of a KafkaSpout and several python multilang ShellBolt. I frequently got the following exceptions.
More information here:
1. Topology run with ACK mode.
2. Topology had 40 workers.
3. Topology emitted about 10 milliom tuples every 10 minutes.
Every time subprocess heartbeat timeout, workers would restart and python processes exited with exitCode:-1, which affected processing capacity and stability of the topology.
I've checked some related issues from Storm Jira. I first found
STORM-1946 reported a bug related to this problem and said bug had been fixed in Storm 1.0.2. However I got the same exception even after I upgraded Storm to 1.0.2.
I checked other related issues. Let's look at history of this problem.
DashengJu first reported this problem with Non-ACK mode in STORM-738.
STORM-742 discussed the approach of this problem with ACK mode, and it seemed that bug had been fixed in 0.10.0. I don't know whether this patch is included in storm-1.x branch. In a word, this problem still exists in the latest stable version.