Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.7.0
-
None
Description
There's a race between Open() and ReportProfile() on the report_thread_active_ member field where, if triggered, ReportProfile() will exit quickly and may never report to the coordinator.
This causes a problem if the coordinator has failed - the fragment instance will never detect this, and will never cancel itself.
The code usually won't hit this, because there's a long enough period for the unsynchronised write to become visible, but I started hitting it with high regularity in my test runs.
PlanFragmentExecutor::Open()
if (!report_status_cb_.empty() && FLAGS_status_report_interval > 0) { unique_lock<mutex> l(report_thread_lock_); report_thread_.reset( new Thread("plan-fragment-executor", "report-profile", &PlanFragmentExecutor::ReportProfile, this)); // make sure the thread started up, otherwise ReportProfile() might get into a race // with StopReportThread() report_thread_started_cv_.wait(l); report_thread_active_ = true; /// <<<<<< Set *after* CV fired by ReportProfile() }
PlanFragmentExecutor::ReportProfile()
unique_lock<mutex> l(report_thread_lock_); // <etc> report_thread_started_cv_.notify_one(); // <etc> - this block yields lock_ and takes long enough for the write // to report_thread_active_ to usually become visible // VVVVVVV -- May execute before Open() sets it while (report_thread_active_) { //.... } // Exit method