[KUDU-1794] kuduScanner 's problem causing impala crash. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: NA
Component/s: client, impala
Labels:
None

Description

Sometimes impalad of my cluster will crash , after study the core file, i found it is the null pointer of data field in ScanResponsePB causing the impalad's crash.
So i modified a little in "NextBatch" in client.cc
"
if (data_->data_in_open_) {
// We have data from a previous scan.
VLOG(1) << "Extracting data from scan " << ToString();
data_->data_in_open_ = false;
auto scan_response_data_ptr = data_->last_response_.release_data();
if (PREDICT_FALSE(scan_response_data_ptr == nullptr)) {
return Status::Corruption(Substitute("Kudu scanner against $0 is in open status,but scan resp has no data.Scan query: $1.Remote: $2",
data_~~>table_~~>name(),data_->configuration()
.spec().ToString(*data_~~>table_~~>schema().schema_),
data_~~>ts_~~>ToString(),
data_->last_response_.DebugString()));
"

Also some modifications in impala part of code:
"
if (UNLIKELY(!status.ok()))

{ LOG(ERROR) <<"KuduScanner::GetNextScannerBatch ERROR["<< status.ToString() << "]"; KUDU_RETURN_IF_ERROR(status, "unable to advance kudu iterator"); }

After these modifications i found these errors in impalad's log:
"E1124 11:46:50.780480 15613 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 180.000s]
"
and
"E1124 11:49:24.171380 16127 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 164.164s: Remote error: Service unavailable: Scan request on kudu.tserver.TabletServerService from 172.22.99.57:64537 dropped due to backpressure. The service queue is full; it has 150 items.]
"
and
"E1124 11:49:24.171378 16128 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 121.593s: Not found: Scanner not found]"

It seems that there are various reason causing the null pointer of data field of ScanResponsePB , but impalad has no way of knowing them.
May be last_response_.has_more_results() should return false when this exception happens?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: zhangsong

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Dec/16 10:26

Updated:: 02/Jun/20 16:06

Resolved:: 02/Jun/20 16:06