Description
Sometimes impalad of my cluster will crash , after study the core file, i found it is the null pointer of data field in ScanResponsePB causing the impalad's crash.
So i modified a little in "NextBatch" in client.cc
"
if (data_->data_in_open_) {
// We have data from a previous scan.
VLOG(1) << "Extracting data from scan " << ToString();
data_->data_in_open_ = false;
auto scan_response_data_ptr = data_->last_response_.release_data();
if (PREDICT_FALSE(scan_response_data_ptr == nullptr)) {
return Status::Corruption(Substitute("Kudu scanner against $0 is in open status,but scan resp has no data.Scan query: $1.Remote: $2",
data_>table_>name(),data_->configuration()
.spec().ToString(*data_>table_>schema().schema_),
data_>ts_>ToString(),
data_->last_response_.DebugString()));
"
Also some modifications in impala part of code:
"
if (UNLIKELY(!status.ok()))
"
After these modifications i found these errors in impalad's log:
"E1124 11:46:50.780480 15613 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 180.000s]
"
and
"E1124 11:49:24.171380 16127 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 164.164s: Remote error: Service unavailable: Scan request on kudu.tserver.TabletServerService from 172.22.99.57:64537 dropped due to backpressure. The service queue is full; it has 150 items.]
"
and
"E1124 11:49:24.171378 16128 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 121.593s: Not found: Scanner not found]"
It seems that there are various reason causing the null pointer of data field of ScanResponsePB , but impalad has no way of knowing them.
May be last_response_.has_more_results() should return false when this exception happens?