Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1794

kuduScanner 's problem causing impala crash.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • NA
    • client, impala
    • None

    Description

      Sometimes impalad of my cluster will crash , after study the core file, i found it is the null pointer of data field in ScanResponsePB causing the impalad's crash.
      So i modified a little in "NextBatch" in client.cc
      "
      if (data_->data_in_open_) {
      // We have data from a previous scan.
      VLOG(1) << "Extracting data from scan " << ToString();
      data_->data_in_open_ = false;
      auto scan_response_data_ptr = data_->last_response_.release_data();
      if (PREDICT_FALSE(scan_response_data_ptr == nullptr)) {
      return Status::Corruption(Substitute("Kudu scanner against $0 is in open status,but scan resp has no data.Scan query: $1.Remote: $2",
      data_>table_>name(),data_->configuration()
      .spec().ToString(*data_>table_>schema().schema_),
      data_>ts_>ToString(),
      data_->last_response_.DebugString()));
      "

      Also some modifications in impala part of code:
      "
      if (UNLIKELY(!status.ok()))

      { LOG(ERROR) <<"KuduScanner::GetNextScannerBatch ERROR["<< status.ToString() << "]"; KUDU_RETURN_IF_ERROR(status, "unable to advance kudu iterator"); }

      "

      After these modifications i found these errors in impalad's log:
      "E1124 11:46:50.780480 15613 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 180.000s]
      "
      and
      "E1124 11:49:24.171380 16127 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 164.164s: Remote error: Service unavailable: Scan request on kudu.tserver.TabletServerService from 172.22.99.57:64537 dropped due to backpressure. The service queue is full; it has 150 items.]
      "
      and
      "E1124 11:49:24.171378 16128 kudu-scanner.cc:422] KuduScanner::GetNextScannerBatch ERROR[Timed out: Scan RPC to 172.22.99.57:7050 timed out after 121.593s: Not found: Scanner not found]"

      It seems that there are various reason causing the null pointer of data field of ScanResponsePB , but impalad has no way of knowing them.
      May be last_response_.has_more_results() should return false when this exception happens?

      Attachments

        Activity

          People

            Unassigned Unassigned
            bruceSz zhangsong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: