[STORM-2194] ReportErrorAndDie doesn't always die - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 1.0.2
Fix Version/s: 2.0.0, 1.0.4, 1.1.1, 1.2.0
Component/s: storm-core
Labels:
- pull-request-available

Description

I've been trying to track down a cause of some of our issues with some exceptions leaving Storm workers in a zombified state for some time. I believe I've isolated the bug to the behaviour in :report-error-and-die/reportErrorAndDie in the executor. Essentially:

     :report-error-and-die (fn [error]
                             (try
                               ((:report-error <>) error)
                               (catch Exception e
                                 (log-message "Error while reporting error to cluster, proceeding with shutdown")))
                             (if (or
                                    (exception-cause? InterruptedException error)
                                    (exception-cause? java.io.InterruptedIOException error))
                               (log-message "Got interrupted excpetion shutting thread down...")
                               ((:suicide-fn <>))))

has the grouping for the if statement slightly wrong. It shouldn't log OR die from InterruptedException/InterruptedIOException, but it should log under that condition, and ALWAYS die.

Basically:

     :report-error-and-die (fn [error]
                             (try
                               ((:report-error <>) error)
                               (catch Exception e
                                 (log-message "Error while reporting error to cluster, proceeding with shutdown")))
                             (if (or
                                    (exception-cause? InterruptedException error)
                                    (exception-cause? java.io.InterruptedIOException error))
                               (log-message "Got interrupted excpetion shutting thread down..."))
                             ((:suicide-fn <>)))

After digging into the Java port of this code, it looks like a different bug was introduced while porting:

        if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
            LOG.info("Got interrupted exception shutting thread down...");
            suicideFn.run();
        }

Was how this was initially ported, and ~~STORM-2142~~ changed this to:

        if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
            LOG.info("Got interrupted exception shutting thread down...");
        } else {
            suicideFn.run();
        }

However, I believe the correct port is as described above:

        if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
            LOG.info("Got interrupted exception shutting thread down...");
        }
        suicideFn.run();

I'll look into providing patches for the 1.x and 2.x branches shortly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

scrubbed-thread-dump.txt
11/Nov/16 20:33
705 kB
Craig Hawco

Issue Links

relates to

STORM-2440 Kafka outage can lead to lockup of topology

Resolved

links to

GitHub Pull Request #1767

GitHub Pull Request #1768

GitHub Pull Request #1932

Activity

People

Assignee:: Paul Poulosky

Reporter:: Craig Hawco

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Nov/16 19:32

Updated:: 22/Oct/18 19:26

Resolved:: 08/Mar/17 21:55

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m