Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5275

Avoid printing Status stack trace on hot paths

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:

      Description

      Summary from Matthew Jacobs:

      There are 3 known issues to fix ASAP:
      1) In ImpalaServer::GetExecSummary(), do not create Status() while holding the lock.
      2) Change error status in impala::DeserializeThriftMsg<>() to use Status::Expected
      3) PartitionedAggregationNode::CodegenUpdateTuple() handling for unsupported types (e.g. char) should use Status::Expected
      Then separately (less time critical) we should investigate how to improve our stack walking as these issues may come up again in the future. I filed IMPALA-5651.

      More detailed initial report for #2 above, from Mostafa:

      When reading Parquet data attempts to find the Parquet header prints a stack trace to the log even if the error was found successfully.
      Also printing the stack trace without logging which Parquet file was being read makes the log message useless.

      Status::Status(const string& error_msg)
        : msg_(new ErrorMsg(TErrorCode::GENERAL, error_msg)) {
        VLOG(1) << msg_->msg() << "\n" << GetStackTrace();
      }
      

      Vtune data showed that capturing the stack trace is very expensive and spends 130ms CPU per invocation.

      This is the expensive call stack btw 
      CPU Time
      1 of 8: 77.9% (0.106s of 0.136s)
      
      vmlinux ! KSTK_ESP - [unknown source file]
      vmlinux ! vm_is_stack + 0xa1 - [unknown source file]
      vmlinux ! show_map_vma + 0x268 - [unknown source file]
      vmlinux ! show_map + 0x20 - [unknown source file]
      vmlinux ! show_pid_map + 0x12 - [unknown source file]
      vmlinux ! seq_read + 0x24f - [unknown source file]
      vmlinux ! vfs_read + 0x9d - [unknown source file]
      vmlinux ! SyS_read + 0x7e - [unknown source file]
      vmlinux ! tracesys + 0xdc - [unknown source file]
      libpthread-2.17.so ! __read + 0x2c - [unknown source file]
      impalad ! google::OpenObjectFileContainingPcAndGetStartAddress + 0x2cf - [unknown source file]
      impalad ! google::SymbolizeAndDemangle + 0x488 - [unknown source file]
      impalad ! google::DumpStackTrace.constprop.9 + 0x6e - [unknown source file]
      impalad ! impala::GetStackTrace + 0x1e - debug-util.cc:321
      impalad ! impala::Status::Status + 0x68 - status.cc:114
      impalad ! impala::DeserializeThriftMsg<parquet::PageHeader> + 0x2ce - thrift-util.h:128
      impalad ! impala::BaseScalarColumnReader::ReadPageHeader + 0x182 - parquet-column-readers.cc:787
      impalad ! impala::BaseScalarColumnReader::ReadDataPage + 0x9b - parquet-column-readers.cc:966
      impalad ! impala::BaseScalarColumnReader::NextPage + 0x6e - parquet-column-readers.cc:1093
      impalad ! ReadValueBatch<false> + 0x196 - parquet-column-readers.cc:306
      impalad ! impala::ScalarColumnReader<long, (bool)1>::ReadNonRepeatedValueBatch + 0x60 - parquet-column-readers.cc:253
      impalad ! impala::HdfsParquetScanner::AssembleRows + 0x313 - hdfs-parquet-scanner.cc:950
      impalad ! impala::HdfsParquetScanner::GetNextInternal + 0x277 - hdfs-parquet-scanner.cc:482
      impalad ! impala::HdfsParquetScanner::ProcessSplit + 0x73 - hdfs-parquet-scanner.cc:409
      impalad ! impala::HdfsScanNode::ProcessSplit + 0x395 - hdfs-scan-node.cc:526
      impalad ! impala::HdfsScanNode::ScannerThread + 0xbb8 - hdfs-scan-node.cc:417
      impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
      

      What gets printed to the log

      I0501 14:13:24.154422 21797 status.cc:119] couldn't deserialize thrift msg:
      No more data to read.
         @           0x838a89  impala::Status::Status()
         @           0xc955af  impala::DeserializeThriftMsg<>()
         @           0xc8c973  impala::BaseScalarColumnReader::ReadPageHeader()
         @           0xc8e0fc  impala::BaseScalarColumnReader::ReadDataPage()
         @           0xc8ee6f  impala::BaseScalarColumnReader::NextPage()
         @           0xc9d5a8  impala::ScalarColumnReader<>::ReadNonRepeatedValueBatch()
         @           0xc72764  impala::HdfsParquetScanner::AssembleRows()
         @           0xc77238  impala::HdfsParquetScanner::GetNextInternal()
         @           0xc71e74  impala::HdfsParquetScanner::ProcessSplit()
         @           0xc4eab6  impala::HdfsScanNode::ProcessSplit()
         @           0xc50719  impala::HdfsScanNode::ScannerThread()
         @           0xbdd739  impala::Thread::SuperviseThread()
         @           0xbde184  boost::detail::thread_data<>::run()
         @           0xe4f3ba  (unknown)
         @       0x3797407aa1  (unknown)
         @       0x37970e893d  (unknown)
      

        Issue Links

          Activity

          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          This also happens in codegen code path, bumping priority.

          1 of 22: 39.5% (43.658s of 110.538s)
          
          vmlinux ! KSTK_ESP - [unknown source file]
          vmlinux ! vm_is_stack + 0xa1 - [unknown source file]
          vmlinux ! show_map_vma + 0x268 - [unknown source file]
          vmlinux ! show_map + 0x20 - [unknown source file]
          vmlinux ! show_pid_map + 0x12 - [unknown source file]
          vmlinux ! seq_read + 0x24f - [unknown source file]
          vmlinux ! vfs_read + 0x9d - [unknown source file]
          vmlinux ! SyS_read + 0x7e - [unknown source file]
          vmlinux ! tracesys + 0xdc - [unknown source file]
          libpthread-2.17.so ! __read + 0x2c - [unknown source file]
          impalad ! func@0x1b90010 + 0x2cf - [unknown source file]
          impalad ! func@0x1b90cbe + 0x1ea - [unknown source file]
          impalad ! func@0x1b8f690 + 0x6e - [unknown source file]
          impalad ! impala::GetStackTrace + 0x1e - [unknown source file]
          impalad ! impala::Status::Status + 0x68 - [unknown source file]
          impalad ! impala::PartitionedAggregationNode::CodegenUpdateTuple + 0x73b - [unknown source file]
          impalad ! impala::PartitionedAggregationNode::CodegenProcessBatch + 0x99 - [unknown source file]
          impalad ! impala::PartitionedAggregationNode::Codegen + 0x68 - [unknown source file]
          impalad ! impala::ExecNode::Codegen + 0x30 - [unknown source file]
          impalad ! impala::TopNNode::Codegen + 0x1e - [unknown source file]
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - This also happens in codegen code path, bumping priority. 1 of 22: 39.5% (43.658s of 110.538s) vmlinux ! KSTK_ESP - [unknown source file] vmlinux ! vm_is_stack + 0xa1 - [unknown source file] vmlinux ! show_map_vma + 0x268 - [unknown source file] vmlinux ! show_map + 0x20 - [unknown source file] vmlinux ! show_pid_map + 0x12 - [unknown source file] vmlinux ! seq_read + 0x24f - [unknown source file] vmlinux ! vfs_read + 0x9d - [unknown source file] vmlinux ! SyS_read + 0x7e - [unknown source file] vmlinux ! tracesys + 0xdc - [unknown source file] libpthread-2.17.so ! __read + 0x2c - [unknown source file] impalad ! func@0x1b90010 + 0x2cf - [unknown source file] impalad ! func@0x1b90cbe + 0x1ea - [unknown source file] impalad ! func@0x1b8f690 + 0x6e - [unknown source file] impalad ! impala::GetStackTrace + 0x1e - [unknown source file] impalad ! impala::Status::Status + 0x68 - [unknown source file] impalad ! impala::PartitionedAggregationNode::CodegenUpdateTuple + 0x73b - [unknown source file] impalad ! impala::PartitionedAggregationNode::CodegenProcessBatch + 0x99 - [unknown source file] impalad ! impala::PartitionedAggregationNode::Codegen + 0x68 - [unknown source file] impalad ! impala::ExecNode::Codegen + 0x30 - [unknown source file] impalad ! impala::TopNNode::Codegen + 0x1e - [unknown source file]
          Hide
          mjacobs Matthew Jacobs added a comment -

          Mostafa Mokhtar I'm confused by the JIRA body text which says this prints a stack trace even if the file is found. However in this example the status isn't OK, it's "couldn't deserialize thrift msg". The code in Status which logs the stack is only called when there is some non-OK status.

          Show
          mjacobs Matthew Jacobs added a comment - Mostafa Mokhtar I'm confused by the JIRA body text which says this prints a stack trace even if the file is found. However in this example the status isn't OK, it's "couldn't deserialize thrift msg". The code in Status which logs the stack is only called when there is some non-OK status.
          Hide
          alex.behm Alexander Behm added a comment -

          Matthew Jacobs, the problem here is that in this code path the non-OK status is actually expected and does not cause the query to fail. So logging this expected non-OK status is not useful.

          Show
          alex.behm Alexander Behm added a comment - Matthew Jacobs , the problem here is that in this code path the non-OK status is actually expected and does not cause the query to fail. So logging this expected non-OK status is not useful.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Alexander Behm Ah makes sense, thanks for clarifying.

          Show
          mjacobs Matthew Jacobs added a comment - Alexander Behm Ah makes sense, thanks for clarifying.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Matthew Jacobs
          There is a loop where BaseScalarColumnReader keeps reading till it finds the actual header size

          https://github.com/apache/incubator-impala/blob/master/be/src/exec/parquet-column-readers.cc#L806

            // We don't know the actual header size until the thrift object is deserialized.  Loop
            // until we successfully deserialize the header or exceed the maximum header size.
            uint32_t header_size;
            Status status;
            while (true) {
              header_size = buffer_size;
              status = DeserializeThriftMsg(buffer, &header_size, true, next_page_header);
              if (status.ok()) break;
          
              if (buffer_size >= FLAGS_max_page_header_size) {
                stringstream ss;
                ss << "ParquetScanner: could not read data page because page header exceeded "
                   << "maximum size of "
                   << PrettyPrinter::Print(FLAGS_max_page_header_size, TUnit::BYTES);
                status.AddDetail(ss.str());
                return status;
              }
          
              // Didn't read entire header, increase buffer size and try again
              int64_t new_buffer_size = max<int64_t>(buffer_size * 2, 1024);
              status = Status::OK();
              bool success = stream_->GetBytes(
                  new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ true);
              if (!success) {
                DCHECK(!status.ok());
                return status;
              }
              DCHECK(status.ok());
          
              // Even though we increased the allowed buffer size, the number of bytes
              // read did not change. The header is not limited by the buffer space,
              // so it must be incomplete in the file.
              if (buffer_size == new_buffer_size) {
                DCHECK_NE(new_buffer_size, 0);
                return Status(TErrorCode::PARQUET_HEADER_EOF, filename());
              }
              DCHECK_GT(new_buffer_size, buffer_size);
              buffer_size = new_buffer_size;
            }
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - Matthew Jacobs There is a loop where BaseScalarColumnReader keeps reading till it finds the actual header size https://github.com/apache/incubator-impala/blob/master/be/src/exec/parquet-column-readers.cc#L806 // We don't know the actual header size until the thrift object is deserialized. Loop // until we successfully deserialize the header or exceed the maximum header size. uint32_t header_size; Status status; while ( true ) { header_size = buffer_size; status = DeserializeThriftMsg(buffer, &header_size, true , next_page_header); if (status.ok()) break ; if (buffer_size >= FLAGS_max_page_header_size) { stringstream ss; ss << "ParquetScanner: could not read data page because page header exceeded " << "maximum size of " << PrettyPrinter::Print(FLAGS_max_page_header_size, TUnit::BYTES); status.AddDetail(ss.str()); return status; } // Didn't read entire header, increase buffer size and try again int64_t new_buffer_size = max<int64_t>(buffer_size * 2, 1024); status = Status::OK(); bool success = stream_->GetBytes( new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ true ); if (!success) { DCHECK(!status.ok()); return status; } DCHECK(status.ok()); // Even though we increased the allowed buffer size, the number of bytes // read did not change. The header is not limited by the buffer space, // so it must be incomplete in the file. if (buffer_size == new_buffer_size) { DCHECK_NE(new_buffer_size, 0); return Status(TErrorCode::PARQUET_HEADER_EOF, filename()); } DCHECK_GT(new_buffer_size, buffer_size); buffer_size = new_buffer_size; }
          Hide
          mjacobs Matthew Jacobs added a comment -

          So then it seems like DeserializeThriftMsg shouldn't be returning a Status, or at least not an error status for this code path. Or do y'all have something else in mind?

          Show
          mjacobs Matthew Jacobs added a comment - So then it seems like DeserializeThriftMsg shouldn't be returning a Status, or at least not an error status for this code path. Or do y'all have something else in mind?
          Hide
          dhecht Dan Hecht added a comment -

          Use Status::Expected()?

          Show
          dhecht Dan Hecht added a comment - Use Status::Expected() ?
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          I recommend that Impala would stop trying to print out the stack using KSTK_ESP.
          When canceling queries the system becomes unresponsive due to the process wide locking involved in printing the stack

          I0623 08:23:13.546174 181030 status.cc:122] Query id 7349c3a708d4fe94:30e42f6400000000 not found.
              @           0x83ebf9  impala::Status::Status()
              @           0xa95e75  impala::ImpalaServer::GetExecSummary()
              @           0xadd569  impala::ImpalaServer::GetExecSummary()
              @           0xd5fce0  impala::ImpalaServiceProcessor::process_GetExecSummary()
              @           0xd5780c  impala::ImpalaServiceProcessor::dispatchCall()
              @           0x80f06c  apache::thrift::TDispatchProcessor::process()
              @          0x1b5945b  apache::thrift::server::TThreadPoolServer::Task::run()
              @          0x1b41019  apache::thrift::concurrency::ThreadManager::Worker::run()
              @           0x9d2d59  impala::ThriftThread::RunRunnable()
              @           0x9d37b2  boost::detail::function::void_function_obj_invoker0<>::invoke()
              @           0xbcaca2  impala::Thread::SuperviseThread()
              @           0xbcb544  boost::detail::thread_data<>::run()
              @           0xe641ca  thread_proxy
              @     0x7fa8f231ddc5  start_thread
              @     0x7fa8f204c76d  __clone
          I0623 08:23:13.549197 181030 impala-beeswax-server.cc:233] close(): query_id=7349c3a708d4fe94:30e42f6400000000
          I0623 08:23:13.554811 181030 impala-server.cc:947] UnregisterQuery(): query_id=7349c3a708d4fe94:30e42f6400000000
          I0623 08:23:13.554823 181030 impala-server.cc:1032] Cancel(): query_id=7349c3a708d4fe94:30e42f6400000000
          
          #0  0x00007fa8f23241bd in __lll_lock_wait () from /lib64/libpthread.so.0
          #1  0x00007fa8f231fd02 in _L_lock_791 () from /lib64/libpthread.so.0
          #2  0x00007fa8f231fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0
          #3  0x0000000000820a48 in pthread_mutex_lock (m=0x1131b520) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:62
          #4  boost::mutex::lock (this=this@entry=0x1131b520) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:116
          #5  0x0000000000a95db0 in lock_guard (m_=..., this=<synthetic pointer>) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/lock_guard.hpp:38
          #6  impala::ImpalaServer::GetExecSummary (this=this@entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa7c5ade758) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/service/impala-server.cc:662
          #7  0x0000000000add569 in impala::ImpalaServer::GetExecSummary (this=0x1131b500, result=..., handle=...) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/service/impala-beeswax-server.cc:387
          #8  0x0000000000d5fce0 in impala::ImpalaServiceProcessor::process_GetExecSummary (this=0x10f07d60, seqid=0, iprot=0x7fa4ba815320, oprot=0x7fa4ba8152f0, callContext=<optimized out>) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/generated-sources/gen-cpp/ImpalaService.cpp:2038
          #9  0x0000000000d5780c in impala::ImpalaServiceProcessor::dispatchCall (this=0x10f07d60, iprot=0x7fa4ba815320, oprot=0x7fa4ba8152f0, fname=..., seqid=0, callContext=0x4f813800) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/generated-sources/gen-cpp/ImpalaService.cpp:1675
          #10 0x000000000080f06c in apache::thrift::TDispatchProcessor::process (this=0x10f07d60, in=..., out=..., connectionContext=0x4f813800) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/thrift-0.9.0-p9/include/thrift/TDispatchProcessor.h:121
          #11 0x0000000001b5945b in apache::thrift::server::TThreadPoolServer::Task::run() ()No symbol table info available.
          #12 0x0000000001b41019 in apache::thrift::concurrency::ThreadManager::Worker::run() ()No symbol table info available.
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - I recommend that Impala would stop trying to print out the stack using KSTK_ESP. When canceling queries the system becomes unresponsive due to the process wide locking involved in printing the stack I0623 08:23:13.546174 181030 status.cc:122] Query id 7349c3a708d4fe94:30e42f6400000000 not found. @ 0x83ebf9 impala::Status::Status() @ 0xa95e75 impala::ImpalaServer::GetExecSummary() @ 0xadd569 impala::ImpalaServer::GetExecSummary() @ 0xd5fce0 impala::ImpalaServiceProcessor::process_GetExecSummary() @ 0xd5780c impala::ImpalaServiceProcessor::dispatchCall() @ 0x80f06c apache::thrift::TDispatchProcessor::process() @ 0x1b5945b apache::thrift::server::TThreadPoolServer::Task::run() @ 0x1b41019 apache::thrift::concurrency::ThreadManager::Worker::run() @ 0x9d2d59 impala::ThriftThread::RunRunnable() @ 0x9d37b2 boost::detail::function::void_function_obj_invoker0<>::invoke() @ 0xbcaca2 impala:: Thread ::SuperviseThread() @ 0xbcb544 boost::detail::thread_data<>::run() @ 0xe641ca thread_proxy @ 0x7fa8f231ddc5 start_thread @ 0x7fa8f204c76d __clone I0623 08:23:13.549197 181030 impala-beeswax-server.cc:233] close(): query_id=7349c3a708d4fe94:30e42f6400000000 I0623 08:23:13.554811 181030 impala-server.cc:947] UnregisterQuery(): query_id=7349c3a708d4fe94:30e42f6400000000 I0623 08:23:13.554823 181030 impala-server.cc:1032] Cancel(): query_id=7349c3a708d4fe94:30e42f6400000000 #0 0x00007fa8f23241bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fa8f231fd02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007fa8f231fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000000000820a48 in pthread_mutex_lock (m=0x1131b520) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:62 #4 boost::mutex::lock ( this = this @entry=0x1131b520) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:116 #5 0x0000000000a95db0 in lock_guard (m_=..., this =<synthetic pointer>) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/lock_guard.hpp:38 #6 impala::ImpalaServer::GetExecSummary ( this = this @entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa7c5ade758) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/service/impala-server.cc:662 #7 0x0000000000add569 in impala::ImpalaServer::GetExecSummary ( this =0x1131b500, result=..., handle=...) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/service/impala-beeswax-server.cc:387 #8 0x0000000000d5fce0 in impala::ImpalaServiceProcessor::process_GetExecSummary ( this =0x10f07d60, seqid=0, iprot=0x7fa4ba815320, oprot=0x7fa4ba8152f0, callContext=<optimized out>) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/generated-sources/gen-cpp/ImpalaService.cpp:2038 #9 0x0000000000d5780c in impala::ImpalaServiceProcessor::dispatchCall ( this =0x10f07d60, iprot=0x7fa4ba815320, oprot=0x7fa4ba8152f0, fname=..., seqid=0, callContext=0x4f813800) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/generated-sources/gen-cpp/ImpalaService.cpp:1675 #10 0x000000000080f06c in apache::thrift::TDispatchProcessor::process ( this =0x10f07d60, in=..., out=..., connectionContext=0x4f813800) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/thrift-0.9.0-p9/include/thrift/TDispatchProcessor.h:121 #11 0x0000000001b5945b in apache::thrift::server::TThreadPoolServer::Task::run() ()No symbol table info available. #12 0x0000000001b41019 in apache::thrift::concurrency::ThreadManager::Worker::run() ()No symbol table info available.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Canceling queries on a large cluster makes the coordinator node unresponsive due to the code involved in fetching the stacks

          Samples: 3M of event 'cycles', Event count (approx.): 2037310401779
          +   93.73%     0.01%       impalad  [kernel.kallsyms]              [k] system_call_fastpath         ◆
          +   91.22%     0.07%       impalad  libpthread-2.12.so             [.] llseek                       ▒
          +   91.01%     0.02%       impalad  [kernel.kallsyms]              [k] sys_lseek                    ▒
          +   90.37%     0.01%       impalad  [kernel.kallsyms]              [k] vfs_llseek                   ▒
          +   90.36%     0.42%       impalad  [kernel.kallsyms]              [k] ext4_llseek                  ▒
          +   89.94%     0.03%       impalad  [kernel.kallsyms]              [k] generic_file_llseek_size     ▒
          +   88.11%     0.93%       impalad  [kernel.kallsyms]              [k] mutex_lock                   ▒
          +   87.16%     1.14%       impalad  [kernel.kallsyms]              [k] __mutex_lock_slowpath        ▒
          -   85.15%    85.15%       impalad  [kernel.kallsyms]              [k] osq_lock                     ▒
             - osq_lock                                                                                       ▒
                - 99.99% __mutex_lock_slowpath                                                                ▒
                     mutex_lock                                                                               ▒
                     generic_file_llseek_size                                                                 ▒
                     ext4_llseek                                                                              ▒
                     vfs_llseek                                                                               ▒
                     sys_lseek                                                                                ▒
                     system_call_fastpath                                                                     ▒
                   - llseek                                                                                   ▒
                        4.27% 0                                                                               ▒
                        0.58% _ZN6googleL14DumpStackTraceEiPFvPKcPvES2_.constprop.9                           ▒
          +    3.54%     0.00%       impalad  [unknown]                      [.] 0000000000000000             ▒
          +    2.73%     0.06%       impalad  libpthread-2.12.so             [.] 0x000000000000e82d           ▒
          +    2.55%     0.02%       impalad  [kernel.kallsyms]              [k] sys_read                     ▒
          +    2.43%     2.19%     cmf-agent  libc-2.12.so                   [.] memcpy                       ▒
          +    2.32%     1.01%       impalad  [kernel.kallsyms]              [k] mutex_unlock                 ▒
          +    2.24%     0.37%       impalad  [kernel.kallsyms]              [k] vfs_read                     ▒
          +    1.50%     0.03%       impalad  [kernel.kallsyms]              [k] do_sync_read                 ▒
          +    1.45%     0.42%       impalad  [kernel.kallsyms]              [k] generic_file_aio_read        ▒
          +    1.38%     1.38%       impalad  [kernel.kallsyms]              [k] _spin_lock                   ▒
          +    1.31%     0.19%       impalad  [kernel.kallsyms]              [k] __mutex_unlock_slowpath      ▒
          +    0.70%     0.00%          init  [kernel.kallsyms]              [k] start_secondary              ▒
          +    0.69%     0.01%          init  [kernel.kallsyms]              [k] cpu_idle                     ▒
          +    0.60%     0.06%       impalad  [kernel.kallsyms]              [k] fget_light_pos
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - Canceling queries on a large cluster makes the coordinator node unresponsive due to the code involved in fetching the stacks Samples: 3M of event 'cycles', Event count (approx.): 2037310401779 + 93.73% 0.01% impalad [kernel.kallsyms] [k] system_call_fastpath ◆ + 91.22% 0.07% impalad libpthread-2.12.so [.] llseek ▒ + 91.01% 0.02% impalad [kernel.kallsyms] [k] sys_lseek ▒ + 90.37% 0.01% impalad [kernel.kallsyms] [k] vfs_llseek ▒ + 90.36% 0.42% impalad [kernel.kallsyms] [k] ext4_llseek ▒ + 89.94% 0.03% impalad [kernel.kallsyms] [k] generic_file_llseek_size ▒ + 88.11% 0.93% impalad [kernel.kallsyms] [k] mutex_lock ▒ + 87.16% 1.14% impalad [kernel.kallsyms] [k] __mutex_lock_slowpath ▒ - 85.15% 85.15% impalad [kernel.kallsyms] [k] osq_lock ▒ - osq_lock ▒ - 99.99% __mutex_lock_slowpath ▒ mutex_lock ▒ generic_file_llseek_size ▒ ext4_llseek ▒ vfs_llseek ▒ sys_lseek ▒ system_call_fastpath ▒ - llseek ▒ 4.27% 0 ▒ 0.58% _ZN6googleL14DumpStackTraceEiPFvPKcPvES2_.constprop.9 ▒ + 3.54% 0.00% impalad [unknown] [.] 0000000000000000 ▒ + 2.73% 0.06% impalad libpthread-2.12.so [.] 0x000000000000e82d ▒ + 2.55% 0.02% impalad [kernel.kallsyms] [k] sys_read ▒ + 2.43% 2.19% cmf-agent libc-2.12.so [.] memcpy ▒ + 2.32% 1.01% impalad [kernel.kallsyms] [k] mutex_unlock ▒ + 2.24% 0.37% impalad [kernel.kallsyms] [k] vfs_read ▒ + 1.50% 0.03% impalad [kernel.kallsyms] [k] do_sync_read ▒ + 1.45% 0.42% impalad [kernel.kallsyms] [k] generic_file_aio_read ▒ + 1.38% 1.38% impalad [kernel.kallsyms] [k] _spin_lock ▒ + 1.31% 0.19% impalad [kernel.kallsyms] [k] __mutex_unlock_slowpath ▒ + 0.70% 0.00% init [kernel.kallsyms] [k] start_secondary ▒ + 0.69% 0.01% init [kernel.kallsyms] [k] cpu_idle ▒ + 0.60% 0.06% impalad [kernel.kallsyms] [k] fget_light_pos
          Hide
          mjacobs Matthew Jacobs added a comment -

          Mostafa Mokhtar, the stack in your comment doesn't look like an issue of printing the stack via the Status. It looks like it's contended on this lock:

          Status ImpalaServer::GetExecSummary(const TUniqueId& query_id, const string& user,
              TExecSummary* result) {
          ...
            // Look for the query in completed query log.
            {
              lock_guard<mutex> l(query_log_lock_);             //  <-----------------  impala-server.cc:662
          ...
          

          Seems like a different issue, no?

          Show
          mjacobs Matthew Jacobs added a comment - Mostafa Mokhtar , the stack in your comment doesn't look like an issue of printing the stack via the Status. It looks like it's contended on this lock: Status ImpalaServer::GetExecSummary( const TUniqueId& query_id, const string& user, TExecSummary* result) { ... // Look for the query in completed query log. { lock_guard<mutex> l(query_log_lock_); // <----------------- impala-server.cc:662 ... Seems like a different issue, no?
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          While one thread is calling DumpStackTrace via Status() the remaining threads get blocked on that lock.
          The attached thread dumps help explain the contention.

            {
              lock_guard<mutex> l(query_log_lock_);
              QueryLogIndex::const_iterator query_record = query_log_index_.find(query_id);
              if (query_record == query_log_index_.end()) {
                stringstream ss;
                ss << "Query id " << PrintId(query_id) << " not found.";
                return Status(ss.str());
              }
          
          #0  0x00007fa8f232449d in read () from /lib64/libpthread.so.0No symbol table info available.
          #1  0x0000000001ba9020 in google::OpenObjectFileContainingPcAndGetStartAddress(unsigned long, unsigned long&, unsigned long&, char*, int) ()No symbol table info available.
          #2  0x0000000001ba9be9 in google::SymbolizeAndDemangle(void*, char*, int) ()No symbol table info available.
          #3  0x0000000001ba843f in google::DumpStackTrace(int, void (*)(char const*, void*), void*) [clone .constprop.9] ()No symbol table info available.
          #4  0x0000000000b556af in impala::GetStackTrace () at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/util/debug-util.cc:324        s = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x76d1ed18 "    @", ' ' <repeats 11 times>, "0x83ebf9  impala::Status::Status()\n    @", ' ' <repeats 11 times>, "0xa95e75  impala::ImpalaServer::GetExecSummary()\n    @", ' ' <repeats 11 times>, "0xadd569  impala::ImpalaServer::GetExecSummary()\n    @", ' ' <repeats 11 times>, "0xd"...}}
          #5  0x000000000083ebf9 in impala::Status::Status (this=0x7fa784a5c480, error_msg=...) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/common/status.cc:122        vlocal__ = 0x28d45cc <fLI::FLAGS_v>
          #6  0x0000000000a95e75 in impala::ImpalaServer::GetExecSummary (this=this@entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa784a5c758) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/service/impala-server.cc:667        ss = {<std::basic_iostream<char, std::char_traits<char> >> = {<std::basic_istream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<std::ios_base> = {_vptr.ios_base = 0x7fa8f2d42c68 <vtable for std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >+104>, static boolalpha = std::_S_boolalpha, static dec = std::_S_dec, static fixed = std::_S_fixed, static hex = std::_S_hex, static internal = std::_S_internal, static left = std::_S_left, static oct = std::_S_oct, static right = std::_S_right, static scientific = std::_S_scientific, static showbase = std::_S_showbase, static showpoint = std::_S_showpoint, static showpos = std::_S_showpos, static skipws = std::_S_skipws, static unitbuf = std::_S_unitbuf, static uppercase = std::_S_uppercase, static adjustfield = std::_S_adjustfield, static basefield = std::_S_basefield, static floatfield = std::_S_floatfield, static badbit = std::_S_badbit, static eofbit = std::_S_eofbit, static failbit = std::_S_failbit, static goodbit = std::_S_goodbit, static app = std::_S_app, static ate = std::_S_ate, static binary = std::_S_bin, static in = std::_S_in, static out = std::_S_out, static trunc = std::_S_trunc, static beg = std::_S_beg, static cur = std::_S_cur, static end = std::_S_end, _M_precision = 6, _M_width = 0, _M_flags = 4098, _M_exception = std::_S_goodbit, _M_streambuf_state = std::_S_goodbit, _M_callbacks = 0x0, _M_word_zero = {_M_pword = 0x0, _M_iword = 0}, _M_local_word = {{_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}}, _M_word_size = 8, _M_word = 0x7fa784a5c2e8, _M_ios_locale = {static none = 0, static ctype = 1, static numeric = 2, static collate = 4, static time = 8, static monetary = 16, static messages = 32, static all = 63, _M_impl = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_classic = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_global = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_categories = 0x7fa8f2d3eac0 <__gnu_cxx::category_names>, static _S_once = 2}}, _M_tie = 0x0, _M_fill = 0 '\000', _M_fill_init = false, _M_streambuf = 0x7fa784a5c258, _M_ctype = 0x7fa8f2d5bd00 <(anonymous namespace)::ctype_c>, _M_num_put = 0x7fa8f2d5bc70 <(anonymous namespace)::num_put_c>, _M_num_get = 0x7fa8f2d5bc80 <(anonymous namespace)::num_get_c>}, _vptr.basic_istream = 0x7fa8f2d42c18 <vtable for std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >+24>, _M_gcount = 0}, <std::basic_ostream<char, std::char_traits<char> >> = {_vptr.basic_ostream = 0x7fa8f2d42c40 <vtable for std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >+64>}, <No data fields>}, _M_stringbuf = {<std::basic_streambuf<char, std::char_traits<char> >> = {_vptr.basic_streambuf = 0x7fa8f2d42790 <vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >+16>, _M_in_beg = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF"..., _M_in_cur = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF"..., _M_in_end = 0x7fa385438919 "uery id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqFx"..., _M_out_beg = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF"..., _M_out_cur = 0x7fa38543894c "7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqFxPyZNM7Ny9tz9OwwXZApeCW+MNykC1kYHIqPc/dwcvIcYrvt4Tx6"..., _M_out_end = 0x7fa385438b18 "", _M_buf_locale = {static none = 0, static ctype = 1, static numeric = 2, static collate = 4, static time = 8, static monetary = 16, static messages = 32, static all = 63, _M_impl = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_classic = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_global = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_categories = 0x7fa8f2d3eac0 <__gnu_cxx::category_names>, static _S_once = 2}}, _M_mode = (std::_S_in | std::_S_out), _M_string = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF"...}}}}        l = {m = @0x1131b520}        query_record = <optimized out>
          
          #0  0x00007fa8f23241bd in __lll_lock_wait () from /lib64/libpthread.so.0 No symbol table info available.
          #1  0x00007fa8f231fd02 in _L_lock_791 () from /lib64/libpthread.so.0 No symbol table info available.
          #2  0x00007fa8f231fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0 No symbol table info available.
          #3  0x0000000000820a48 in pthread_mutex_lock (m=0x1131b520) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:62        ret = <optimized out>
          #4  boost::mutex::lock (this=this@entry=0x1131b520) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:116        res = <optimized out>
          #5  0x0000000000a95db0 in lock_guard (m_=..., this=<synthetic pointer>) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/lock_guard.hpp:38No locals.
          #6  impala::ImpalaServer::GetExecSummary (this=this@entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa7c52dd758) at /data/jenkins/workspace/impala-private-build-binaries/repos/Impala/be/src/service/impala-server.cc:662        l = {m = @0x1131b520}        query_record = <optimized out>
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - While one thread is calling DumpStackTrace via Status() the remaining threads get blocked on that lock. The attached thread dumps help explain the contention. { lock_guard<mutex> l(query_log_lock_); QueryLogIndex::const_iterator query_record = query_log_index_.find(query_id); if (query_record == query_log_index_.end()) { stringstream ss; ss << "Query id " << PrintId(query_id) << " not found." ; return Status(ss.str()); } #0 0x00007fa8f232449d in read () from /lib64/libpthread.so.0No symbol table info available. #1 0x0000000001ba9020 in google::OpenObjectFileContainingPcAndGetStartAddress(unsigned long , unsigned long &, unsigned long &, char *, int ) ()No symbol table info available. #2 0x0000000001ba9be9 in google::SymbolizeAndDemangle(void*, char *, int ) ()No symbol table info available. #3 0x0000000001ba843f in google::DumpStackTrace( int , void (*)( char const *, void*), void*) [clone .constprop.9] ()No symbol table info available. #4 0x0000000000b556af in impala::GetStackTrace () at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/util/debug-util.cc:324 s = { static npos = <optimized out>, _M_dataplus = {<std::allocator< char >> = {<__gnu_cxx::new_allocator< char >> = {<No data fields>}, <No data fields>}, _M_p = 0x76d1ed18 " @" , ' ' <repeats 11 times>, "0x83ebf9 impala::Status::Status()\n @" , ' ' <repeats 11 times>, "0xa95e75 impala::ImpalaServer::GetExecSummary()\n @" , ' ' <repeats 11 times>, "0xadd569 impala::ImpalaServer::GetExecSummary()\n @" , ' ' <repeats 11 times>, "0xd" ...}} #5 0x000000000083ebf9 in impala::Status::Status ( this =0x7fa784a5c480, error_msg=...) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/common/status.cc:122 vlocal__ = 0x28d45cc <fLI::FLAGS_v> #6 0x0000000000a95e75 in impala::ImpalaServer::GetExecSummary ( this = this @entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa784a5c758) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/service/impala-server.cc:667 ss = {<std::basic_iostream< char , std::char_traits< char > >> = {<std::basic_istream< char , std::char_traits< char > >> = {<std::basic_ios< char , std::char_traits< char > >> = {<std::ios_base> = {_vptr.ios_base = 0x7fa8f2d42c68 <vtable for std::basic_stringstream< char , std::char_traits< char >, std::allocator< char > >+104>, static boolalpha = std::_S_boolalpha, static dec = std::_S_dec, static fixed = std::_S_fixed, static hex = std::_S_hex, static internal = std::_S_internal, static left = std::_S_left, static oct = std::_S_oct, static right = std::_S_right, static scientific = std::_S_scientific, static showbase = std::_S_showbase, static showpoint = std::_S_showpoint, static showpos = std::_S_showpos, static skipws = std::_S_skipws, static unitbuf = std::_S_unitbuf, static uppercase = std::_S_uppercase, static adjustfield = std::_S_adjustfield, static basefield = std::_S_basefield, static floatfield = std::_S_floatfield, static badbit = std::_S_badbit, static eofbit = std::_S_eofbit, static failbit = std::_S_failbit, static goodbit = std::_S_goodbit, static app = std::_S_app, static ate = std::_S_ate, static binary = std::_S_bin, static in = std::_S_in, static out = std::_S_out, static trunc = std::_S_trunc, static beg = std::_S_beg, static cur = std::_S_cur, static end = std::_S_end, _M_precision = 6, _M_width = 0, _M_flags = 4098, _M_exception = std::_S_goodbit, _M_streambuf_state = std::_S_goodbit, _M_callbacks = 0x0, _M_word_zero = {_M_pword = 0x0, _M_iword = 0}, _M_local_word = {{_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}, {_M_pword = 0x0, _M_iword = 0}}, _M_word_size = 8, _M_word = 0x7fa784a5c2e8, _M_ios_locale = { static none = 0, static ctype = 1, static numeric = 2, static collate = 4, static time = 8, static monetary = 16, static messages = 32, static all = 63, _M_impl = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_classic = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_global = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_categories = 0x7fa8f2d3eac0 <__gnu_cxx::category_names>, static _S_once = 2}}, _M_tie = 0x0, _M_fill = 0 '\000', _M_fill_init = false , _M_streambuf = 0x7fa784a5c258, _M_ctype = 0x7fa8f2d5bd00 <(anonymous namespace)::ctype_c>, _M_num_put = 0x7fa8f2d5bc70 <(anonymous namespace)::num_put_c>, _M_num_get = 0x7fa8f2d5bc80 <(anonymous namespace)::num_get_c>}, _vptr.basic_istream = 0x7fa8f2d42c18 <vtable for std::basic_stringstream< char , std::char_traits< char >, std::allocator< char > >+24>, _M_gcount = 0}, <std::basic_ostream< char , std::char_traits< char > >> = {_vptr.basic_ostream = 0x7fa8f2d42c40 <vtable for std::basic_stringstream< char , std::char_traits< char >, std::allocator< char > >+64>}, <No data fields>}, _M_stringbuf = {<std::basic_streambuf< char , std::char_traits< char > >> = {_vptr.basic_streambuf = 0x7fa8f2d42790 <vtable for std::basic_stringbuf< char , std::char_traits< char >, std::allocator< char > >+16>, _M_in_beg = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF" ..., _M_in_cur = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF" ..., _M_in_end = 0x7fa385438919 "uery id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqFx" ..., _M_out_beg = 0x7fa385438918 "Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF" ..., _M_out_cur = 0x7fa38543894c "7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqFxPyZNM7Ny9tz9OwwXZApeCW+MNykC1kYHIqPc/dwcvIcYrvt4Tx6" ..., _M_out_end = 0x7fa385438b18 "", _M_buf_locale = { static none = 0, static ctype = 1, static numeric = 2, static collate = 4, static time = 8, static monetary = 16, static messages = 32, static all = 63, _M_impl = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_classic = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_global = 0x7fa8f2d5c1c0 <(anonymous namespace)::c_locale_impl>, static _S_categories = 0x7fa8f2d3eac0 <__gnu_cxx::category_names>, static _S_once = 2}}, _M_mode = (std::_S_in | std::_S_out), _M_string = { static npos = <optimized out>, _M_dataplus = {<std::allocator< char >> = {<__gnu_cxx::new_allocator< char >> = {<No data fields>}, <No data fields>}, _M_p = 0x7fa385438918 " Query id c4404c4ece32dd6:c5fafb8100000000 not found.7KgUF8/uYnle5/U/hEioiGhEaKhoRIqKhEqIl8gICQmJlFBJUQmRCDmkiET4f56Z3WV2Zp5hZlkRaBYXmdszzzzX7/fzvJ9nvNornT29/xaBG1ZhE4P1U7VqhXKpQrpUqVHgSqUiyFe9VIJrZSqF"...}}}} l = {m = @0x1131b520} query_record = <optimized out> #0 0x00007fa8f23241bd in __lll_lock_wait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fa8f231fd02 in _L_lock_791 () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00007fa8f231fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0 No symbol table info available. #3 0x0000000000820a48 in pthread_mutex_lock (m=0x1131b520) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:62 ret = <optimized out> #4 boost::mutex::lock ( this = this @entry=0x1131b520) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/pthread/mutex.hpp:116 res = <optimized out> #5 0x0000000000a95db0 in lock_guard (m_=..., this =<synthetic pointer>) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/toolchain/boost-1.57.0-p2/include/boost/thread/lock_guard.hpp:38No locals. #6 impala::ImpalaServer::GetExecSummary ( this = this @entry=0x1131b500, query_id=..., user=..., result=result@entry=0x7fa7c52dd758) at /data/jenkins/workspace/impala- private -build-binaries/repos/Impala/be/src/service/impala-server.cc:662 l = {m = @0x1131b520} query_record = <optimized out>
          Hide
          mjacobs Matthew Jacobs added a comment -

          Thanks for explaining, Mostafa Mokhtar.

          So to summarize, there are 3 known issues to fix (ASAP):

          1) In ImpalaServer::GetExecSummary(), do not create Status() while holding the lock.
          2) Change error status in impala::DeserializeThriftMsg<>() to use Status::Expected
          3) PartitionedAggregationNode::CodegenUpdateTuple() handling for unsupported types (e.g. char) should use Status::Expected

          Then separately (less time critical) we should investigate how to improve our stack walking as these issues may come up again in the future.

          Does that seem reasonable Mostafa Mokhtar?

          Show
          mjacobs Matthew Jacobs added a comment - Thanks for explaining, Mostafa Mokhtar . So to summarize, there are 3 known issues to fix (ASAP): 1) In ImpalaServer::GetExecSummary(), do not create Status() while holding the lock. 2) Change error status in impala::DeserializeThriftMsg<>() to use Status::Expected 3) PartitionedAggregationNode::CodegenUpdateTuple() handling for unsupported types (e.g. char) should use Status::Expected Then separately (less time critical) we should investigate how to improve our stack walking as these issues may come up again in the future. Does that seem reasonable Mostafa Mokhtar ?
          Hide
          mjacobs Matthew Jacobs added a comment -

          Ok, I've updated the summary to be more clear. I filed IMPALA-5651 to track the general issue of investigating a cheaper way to print stack traces.

          Show
          mjacobs Matthew Jacobs added a comment - Ok, I've updated the summary to be more clear. I filed IMPALA-5651 to track the general issue of investigating a cheaper way to print stack traces.
          Hide
          bikramjeet.vig Bikramjeet Vig added a comment -

          Like in PartitionedAggregationNode::CodegenUpdateTuple, if printing stacktrace is a problem in codegen path, should this also be updated to use Status::Expected() ?

          Status HdfsScanner::CodegenWriteCompleteTuple(HdfsScanNodeBase* node,
              LlvmCodeGen* codegen, const vector<ScalarExpr*>& conjuncts,
              Function** write_complete_tuple_fn) {
            *write_complete_tuple_fn = NULL;
            SCOPED_TIMER(codegen->codegen_timer());
            RuntimeState* state = node->runtime_state();
          
            // TODO: Timestamp is not yet supported
            for (int i = 0; i < node->materialized_slots().size(); ++i) {
              SlotDescriptor* slot_desc = node->materialized_slots()[i];
              if (slot_desc->type().type == TYPE_TIMESTAMP) {
                return Status("Timestamp not yet supported for codegen.");
              }
              if (slot_desc->type().type == TYPE_DECIMAL) {
                return Status("Decimal not yet supported for codegen.");
              }
            }
          
          Show
          bikramjeet.vig Bikramjeet Vig added a comment - Like in PartitionedAggregationNode::CodegenUpdateTuple, if printing stacktrace is a problem in codegen path, should this also be updated to use Status::Expected() ? Status HdfsScanner::CodegenWriteCompleteTuple(HdfsScanNodeBase* node, LlvmCodeGen* codegen, const vector<ScalarExpr*>& conjuncts, Function** write_complete_tuple_fn) { *write_complete_tuple_fn = NULL; SCOPED_TIMER(codegen->codegen_timer()); RuntimeState* state = node->runtime_state(); // TODO: Timestamp is not yet supported for ( int i = 0; i < node->materialized_slots().size(); ++i) { SlotDescriptor* slot_desc = node->materialized_slots()[i]; if (slot_desc->type().type == TYPE_TIMESTAMP) { return Status( "Timestamp not yet supported for codegen." ); } if (slot_desc->type().type == TYPE_DECIMAL) { return Status( "Decimal not yet supported for codegen." ); } }
          Hide
          mjacobs Matthew Jacobs added a comment -

          Yup we can change the HdfsScanner cases as well.

          Show
          mjacobs Matthew Jacobs added a comment - Yup we can change the HdfsScanner cases as well.
          Hide
          srus@cloudera.com Silvius Rus added a comment -

          Bikramjeet Vig, Matthew Jacobs, can this issue be resolved?

          Show
          srus@cloudera.com Silvius Rus added a comment - Bikramjeet Vig , Matthew Jacobs , can this issue be resolved?

            People

            • Assignee:
              bikramjeet.vig Bikramjeet Vig
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development