Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4026

Investigate regression introduced by "IMPALA-3629: Codegen TransferScratchTuples"

    Details

      Description

      IMPALA-3629 seems to have introduced a regression in https://github.com/apache/incubator-impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_filter_bigint_non_selective.test.
      Interestingly of the 6 scan primitives this is the only one which regressed.

      Scan node with change

        HDFS_SCAN_NODE (id=0):(Total: 1s173ms, non-child: 1s173ms, % non-child: 100.00%)
                ExecOption: Expr Evaluation Codegen Enabled, PARQUET Codegen Enabled, Codegen enabled: 18 out of 18
                Hdfs split stats (<volume id>:<# splits>/<split lengths>): 8:1/253.41 MB 4:2/506.80 MB 5:2/506.78 MB 7:1/253.41 MB 3:2/506.77 MB 0:2/506.77 MB 2:2/506.79 MB 1:2/506.82 MB 10:1/253.40 MB 9:2/506.79 MB 6:1/253.39 MB 
                Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 12:0% 13:0% 14:0% 
                File Formats: PARQUET/SNAPPY:18 
                BytesRead(500.000ms): 116.63 MB, 116.63 MB, 190.83 MB
                 - AverageHdfsReadThreadConcurrency: 0.00 
                 - AverageScannerThreadConcurrency: 9.67 
                 - BytesRead: 190.83 MB (200101412)
                 - BytesReadDataNodeCache: 0
                 - BytesReadLocal: 190.83 MB (200101412)
                 - BytesReadRemoteUnexpected: 0
                 - BytesReadShortCircuit: 190.83 MB (200101412)
                 - DecompressionTime: 1s618ms
                 - MaxCompressedTextFileLength: 0
                 - NumColumns: 1 (1)
                 - NumDisksAccessed: 11 (11)
                 - NumRowGroups: 18 (18)
                 - NumScannerThreadsStarted: 11 (11)
                 - PeakMemoryUsage: 141.30 MB (148162336)
                 - PerReadThreadRawHdfsThroughput: 1.22 GB/sec
                 - RemoteScanRanges: 0 (0)
                 - RowsRead: 124.58M (124581309)
                 - RowsReturned: 124.58M (124581309)
                 - RowsReturnedRate: 106.17 M/sec
                 - ScanRangesComplete: 18 (18)
                 - ScannerThreadsInvoluntaryContextSwitches: 1.39K (1386)
                 - ScannerThreadsTotalWallClockTime: 14s246ms
                   - MaterializeTupleTime(*): 2s804ms
                   - ScannerThreadsSysTime: 650.896ms
                   - ScannerThreadsUserTime: 5s693ms
                 - ScannerThreadsVoluntaryContextSwitches: 124.53K (124533)
                 - TotalRawHdfsReadTime(*): 152.965ms
                 - TotalReadThroughput: 127.22 MB/sec
      

      Scan node without change

       HDFS_SCAN_NODE (id=0):(Total: 559.157ms, non-child: 559.157ms, % non-child: 100.00%)
                ExecOption: Expr Evaluation Codegen Enabled, Codegen enabled: 0 out of 17
                Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/253.39 MB 6:1/253.39 MB 10:1/253.38 MB 9:2/506.79 MB 7:2/506.79 MB 8:1/253.39 MB 3:2/506.77 MB 5:1/253.40 MB 1:3/760.22 MB 4:3/760.15 MB 
                Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 12:0% 13:0% 14:0% 
                File Formats: PARQUET/SNAPPY:17 
                BytesRead(500.000ms): 116.63 MB
                 - AverageHdfsReadThreadConcurrency: 0.00 
                 - AverageScannerThreadConcurrency: 11.00 
                 - BytesRead: 180.22 MB (188973810)
                 - BytesReadDataNodeCache: 0
                 - BytesReadLocal: 180.22 MB (188973810)
                 - BytesReadRemoteUnexpected: 0
                 - BytesReadShortCircuit: 180.22 MB (188973810)
                 - DecompressionTime: 1s011ms
                 - MaxCompressedTextFileLength: 0
                 - NumColumns: 1 (1)
                 - NumDisksAccessed: 10 (10)
                 - NumRowGroups: 17 (17)
                 - NumScannerThreadsStarted: 11 (11)
                 - PeakMemoryUsage: 143.67 MB (150652704)
                 - PerReadThreadRawHdfsThroughput: 1.45 GB/sec
                 - RemoteScanRanges: 0 (0)
                 - RowsRead: 117.66M (117660198)
                 - RowsReturned: 117.66M (117660198)
                 - RowsReturnedRate: 210.42 M/sec
                 - ScanRangesComplete: 17 (17)
                 - ScannerThreadsInvoluntaryContextSwitches: 746 (746)
                 - ScannerThreadsTotalWallClockTime: 7s055ms
                   - MaterializeTupleTime(*): 3s150ms
                   - ScannerThreadsSysTime: 202.962ms
                   - ScannerThreadsUserTime: 4s693ms
                 - ScannerThreadsVoluntaryContextSwitches: 48.68K (48680)
                 - TotalRawHdfsReadTime(*): 121.070ms
                 - TotalReadThroughput: 233.25 MB/sec
      
      1. primitive_filter_bigint_non_selective.txt
        237 kB
        Jim Apple
      2. benchmark_report_full.txt
        47 kB
        Michael Ho
      3. filter_bigint_non_selective_post_1137_IMPALA-3629_3x_1.zip
        2.69 MB
        Mostafa Mokhtar
      4. primitive_filter_bigint_non_selective.txt
        234 kB
        Mostafa Mokhtar

        Issue Links

          Activity

          Hide
          kwho Michael Ho added a comment -

          https://github.com/apache/incubator-impala/commit/705045021ed1f53cdfeabf0dcc01b76144e54855

          IMPALA-4026: Implement double-buffering for BlockingQueue

          With recent changes to improve the parquet scanner's efficency,
          row batches are produced more quickly, leading to higher contention
          in the blocking queue shared between scanner threads and the scan
          node. The contention happens between different producers (i.e.
          the scanner threads) and also to a lesser extent, between the
          scanner threads and the scan node.

          This change addresses the contention between the scanner threads
          and the scan node by splitting the queue into a 'get_list_' and
          a 'put_list_'. The consumers will consume from 'get_list_' until
          it's exhausted while the producers will enqueue into 'put_list_'
          until it's full. When 'get_list_' is exhausted, the consumer will
          atomically swap the 'get_list_' with 'put_list_'. This reduces
          the contention: 'get_list_' and 'put_list_' are protected by two
          different locks so callers of BlockingGet() only contends for the
          'put_lock_' when 'put_list_' is empty.

          With this change, primitive_filter_bigint_non_selective improves
          by 33.9%, going from 1.60s to 1.06s

          Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54
          Reviewed-on: http://gerrit.cloudera.org:8080/4350
          Reviewed-by: Michael Ho <kwho@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          kwho Michael Ho added a comment - https://github.com/apache/incubator-impala/commit/705045021ed1f53cdfeabf0dcc01b76144e54855 IMPALA-4026 : Implement double-buffering for BlockingQueue With recent changes to improve the parquet scanner's efficency, row batches are produced more quickly, leading to higher contention in the blocking queue shared between scanner threads and the scan node. The contention happens between different producers (i.e. the scanner threads) and also to a lesser extent, between the scanner threads and the scan node. This change addresses the contention between the scanner threads and the scan node by splitting the queue into a 'get_list_' and a 'put_list_'. The consumers will consume from 'get_list_' until it's exhausted while the producers will enqueue into 'put_list_' until it's full. When 'get_list_' is exhausted, the consumer will atomically swap the 'get_list_' with 'put_list_'. This reduces the contention: 'get_list_' and 'put_list_' are protected by two different locks so callers of BlockingGet() only contends for the 'put_lock_' when 'put_list_' is empty. With this change, primitive_filter_bigint_non_selective improves by 33.9%, going from 1.60s to 1.06s Change-Id: Ib9f4cf351455efefb0f3bb791cf9bc82d1421d54 Reviewed-on: http://gerrit.cloudera.org:8080/4350 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins
          Hide
          kwho Michael Ho added a comment -

          With the changes in https://github.com/michaelhkw/incubator-impala/tree/IMPALA-4026, the regression in targeted-perf goes away. In fact, the result is better than the baseline number (1.2s) before. (see attached benchmark_report.txt)

          However, TPCH-Q3 and TPCH-Q4 regressed. Based on the VTune profiles, memset() in InitTuple() somehow takes a long more time than before. Still investigating.

          Show
          kwho Michael Ho added a comment - With the changes in https://github.com/michaelhkw/incubator-impala/tree/IMPALA-4026 , the regression in targeted-perf goes away. In fact, the result is better than the baseline number (1.2s) before. (see attached benchmark_report.txt) However, TPCH-Q3 and TPCH-Q4 regressed. Based on the VTune profiles, memset() in InitTuple() somehow takes a long more time than before. Still investigating.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          At this point it might make sense to looking into enhancing the Blocking queue

          Show
          mmokhtar Mostafa Mokhtar added a comment - At this point it might make sense to looking into enhancing the Blocking queue
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          It appears that making the scanner code faster by codegen ExecNode::EvalConjuncts causes lots of write-write as well as read-write contention on BlockingQueue<impala::RowBatch*>.

          Also I compared performance before and after IMPALA-3629 with NUM_SCANNER_THREADS=1 and with IMPALA-3629 performance was 40% better.

          To address this regression BlockingQueue needs to scale better in terms of number of writers.

          The increase in CPU is coming from these methods

          Function / Call Stack CPU Time Module
          pthread_cond_wait 3.260s libpthread.so.0
          pthread_cond_signal 3.246s libpthread.so.0
          pthread_mutex_unlock 3.173s impalad
          pthread_mutex_lock 1.798s impalad
          _L_unlock_657 1.488s libpthread.so.0

          Write-write stack

          CPU Time
          1 of 5: 98.5% (3.210s of 3.260s)
          
          libpthread.so.0 ! pthread_cond_wait - [unknown source file]
          impalad ! boost::condition_variable::wait + 0x52 - condition_variable.hpp:73
          impalad ! impala::BlockingQueue<impala::RowBatch*>::BlockingPut + 0x14a - blocking-queue.h:80
          impalad ! impala::ExecNode::RowBatchQueue::AddBatch + 0x48 - exec-node.cc:83
          impalad ! impala::HdfsParquetScanner::ProcessSplit + 0x55 - hdfs-parquet-scanner.cc:315
          impalad ! impala::HdfsScanNode::ProcessSplit + 0x375 - hdfs-scan-node.cc:1249
          impalad ! impala::HdfsScanNode::ScannerThread + 0xcad - hdfs-scan-node.cc:1109
          impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
          impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318
          impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
          impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
          impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
          impalad ! thread_proxy + 0xd9 - [unknown source file]
          libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
          libc.so.6 ! clone + 0x6c - [unknown source file]
          

          And read-write stack

          CPU Time
          1 of 1: 100.0% (3.086s of 3.086s)
          
          libpthread.so.0 ! pthread_cond_signal - [unknown source file]
          impalad ! boost::condition_variable::notify_one + 0x17 - condition_variable.hpp:134
          impalad ! impala::BlockingQueue<impala::RowBatch*>::BlockingGet + 0x129 - blocking-queue.h:61
          impalad ! impala::ExecNode::RowBatchQueue::GetBatch + 0x4 - exec-node.cc:96
          impalad ! impala::HdfsScanNode::GetNextInternal + 0xaf - hdfs-scan-node.cc:298
          impalad ! impala::HdfsScanNode::GetNext + 0x477 - hdfs-scan-node.cc:280
          impalad ! impala::PartitionedAggregationNode::Open + 0x29d - partitioned-aggregation-node.cc:313
          impalad ! impala::PlanFragmentExecutor::OpenInternal + 0xbc - plan-fragment-executor.cc:371
          impalad ! impala::PlanFragmentExecutor::Open + 0x273 - plan-fragment-executor.cc:344
          impalad ! impala::FragmentMgr::FragmentExecState::Exec + 0x17 - fragment-exec-state.cc:58
          impalad ! impala::FragmentMgr::FragmentThread + 0x331 - fragment-mgr.cc:90
          impalad ! boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>::operator() + 0x42 - mem_fn_template.hpp:165
          impalad ! operator()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0> - bind.hpp:313
          impalad ! boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>::operator() - bind_template.hpp:20
          impalad ! boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>, void>::invoke + 0x7 - function_template.hpp:153
          impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
          impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:318
          impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
          impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
          impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
          impalad ! thread_proxy + 0xd9 - [unknown source file]
          libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
          libc.so.6 ! clone + 0x6c - [unknown source file]
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - It appears that making the scanner code faster by codegen ExecNode::EvalConjuncts causes lots of write-write as well as read-write contention on BlockingQueue<impala::RowBatch*>. Also I compared performance before and after IMPALA-3629 with NUM_SCANNER_THREADS=1 and with IMPALA-3629 performance was 40% better. To address this regression BlockingQueue needs to scale better in terms of number of writers. The increase in CPU is coming from these methods Function / Call Stack CPU Time Module pthread_cond_wait 3.260s libpthread.so.0 pthread_cond_signal 3.246s libpthread.so.0 pthread_mutex_unlock 3.173s impalad pthread_mutex_lock 1.798s impalad _L_unlock_657 1.488s libpthread.so.0 Write-write stack CPU Time 1 of 5: 98.5% (3.210s of 3.260s) libpthread.so.0 ! pthread_cond_wait - [unknown source file] impalad ! boost::condition_variable::wait + 0x52 - condition_variable.hpp:73 impalad ! impala::BlockingQueue<impala::RowBatch*>::BlockingPut + 0x14a - blocking-queue.h:80 impalad ! impala::ExecNode::RowBatchQueue::AddBatch + 0x48 - exec-node.cc:83 impalad ! impala::HdfsParquetScanner::ProcessSplit + 0x55 - hdfs-parquet-scanner.cc:315 impalad ! impala::HdfsScanNode::ProcessSplit + 0x375 - hdfs-scan-node.cc:1249 impalad ! impala::HdfsScanNode::ScannerThread + 0xcad - hdfs-scan-node.cc:1109 impalad ! boost::function0<void>:: operator () + 0x1a - function_template.hpp:767 impalad ! impala:: Thread ::SuperviseThread + 0x20e - thread.cc:318 impalad ! operator ()<void (*)( const std::basic_string< char >&, const std::basic_string< char >&, boost::function<void()>, impala::Promise< long int >*), boost::_bi::list0> + 0x5a - bind.hpp:457 impalad ! boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>:: operator () - bind_template.hpp:20 impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>>::run + 0x19 - thread.hpp:116 impalad ! thread_proxy + 0xd9 - [unknown source file] libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] libc.so.6 ! clone + 0x6c - [unknown source file] And read-write stack CPU Time 1 of 1: 100.0% (3.086s of 3.086s) libpthread.so.0 ! pthread_cond_signal - [unknown source file] impalad ! boost::condition_variable::notify_one + 0x17 - condition_variable.hpp:134 impalad ! impala::BlockingQueue<impala::RowBatch*>::BlockingGet + 0x129 - blocking-queue.h:61 impalad ! impala::ExecNode::RowBatchQueue::GetBatch + 0x4 - exec-node.cc:96 impalad ! impala::HdfsScanNode::GetNextInternal + 0xaf - hdfs-scan-node.cc:298 impalad ! impala::HdfsScanNode::GetNext + 0x477 - hdfs-scan-node.cc:280 impalad ! impala::PartitionedAggregationNode::Open + 0x29d - partitioned-aggregation-node.cc:313 impalad ! impala::PlanFragmentExecutor::OpenInternal + 0xbc - plan-fragment-executor.cc:371 impalad ! impala::PlanFragmentExecutor::Open + 0x273 - plan-fragment-executor.cc:344 impalad ! impala::FragmentMgr::FragmentExecState::Exec + 0x17 - fragment-exec-state.cc:58 impalad ! impala::FragmentMgr::FragmentThread + 0x331 - fragment-mgr.cc:90 impalad ! boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>:: operator () + 0x42 - mem_fn_template.hpp:165 impalad ! operator ()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0> - bind.hpp:313 impalad ! boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>:: operator () - bind_template.hpp:20 impalad ! boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>, void>::invoke + 0x7 - function_template.hpp:153 impalad ! boost::function0<void>:: operator () + 0x1a - function_template.hpp:767 impalad ! impala:: Thread ::SuperviseThread + 0x20e - thread.cc:318 impalad ! operator ()<void (*)( const std::basic_string< char >&, const std::basic_string< char >&, boost::function<void()>, impala::Promise< long int >*), boost::_bi::list0> + 0x5a - bind.hpp:457 impalad ! boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>:: operator () - bind_template.hpp:20 impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>>::run + 0x19 - thread.hpp:116 impalad ! thread_proxy + 0xd9 - [unknown source file] libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] libc.so.6 ! clone + 0x6c - [unknown source file]
          Hide
          dhecht Dan Hecht added a comment -

          Mostafa Mokhtar any update on this one? What else do you plan to do before we resolve this as either not a regression or we have something we need to fix?

          Show
          dhecht Dan Hecht added a comment - Mostafa Mokhtar any update on this one? What else do you plan to do before we resolve this as either not a regression or we have something we need to fix?
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Regression didn't repro locally.

          Show
          mmokhtar Mostafa Mokhtar added a comment - Regression didn't repro locally.

            People

            • Assignee:
              kwho Michael Ho
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development