Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4923

Operators running on top of selective Hdfs scan nodes spend a lot of time calling impala::MemPool::FreeAll on empty batches

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.6.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:
      None

      Description

      Operators that are executed after a highly selective scan node spend a lot of time calling impala::MemPool::FreeAll on row batches with all rows filtered out.

      So even if an operator ends up processing 0 rows it still has to clear the memory allocated for the empty batches created by the HdfsScanNode.

      https://github.com/apache/incubator-impala/blob/2.7.0/be/src/runtime/row-batch.cc#L317

      Should try using Clear() and investigate the repercussions.

      Repro query

      select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_comment from lineitem where l_orderkey=0 group by l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_comment order by l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_comment limit 10
      
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------
      | Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail       |
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+--------------
      | 05:MERGING-EXCHANGE | 1      | 34.56us  | 34.56us  | 0     | 4          | 0 B       | -1 B          | UNPARTITIONED|
      | 02:TOP-N            | 7      | 388.39us | 1.71ms   | 0     | 4          | 12.00 KB  | 986 B         |              |
      | 04:AGGREGATE        | 7      | 3.72ms   | 9.59ms   | 0     | 4          | 2.45 MB   | 10.00 MB      | FINALIZE     |
      | 03:EXCHANGE         | 7      | 6.88us   | 8.15us   | 0     | 4          | 0 B       | 0 
      | 01:AGGREGATE        | 7      | 8.42s    | 9.10s    | 0     | 4          | 10.14 MB  | 10.00 MB      | STREAMING                 |
      | 00:SCAN HDFS        | 7      | 34.07s   | 37.75s   | 0     | 4          | 466.98 MB | 176.00 MB     | tpch_300_parquet.lineitem 
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------
      
      CPU Time
      1 of 27: 74.4% (5.990s of 8.050s)
      
      libc.so.6 ! madvise - [unknown source file]
      impalad ! TCMalloc_SystemRelease + 0x79 - [unknown source file]
      impalad ! tcmalloc::PageHeap::DecommitSpan + 0x20 - [unknown source file]
      impalad ! tcmalloc::PageHeap::MergeIntoFreeList + 0x212 - [unknown source file]
      impalad ! tcmalloc::PageHeap::Delete + 0x23 - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::ReleaseToSpans + 0x10f - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
      impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
      impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
      impalad ! operator delete + 0x329 - [unknown source file]
      impalad ! impala::MemPool::FreeAll + 0x59 - mem-pool.cc:90
      impalad ! impala::RowBatch::Reset + 0x2c - row-batch.cc:312
      impalad ! impala::PartitionedAggregationNode::GetRowsStreaming + 0x1af - partitioned-aggregation-node.cc:588
      impalad ! impala::PartitionedAggregationNode::GetNextInternal + 0x260 - partitioned-aggregation-node.cc:451
      impalad ! impala::PartitionedAggregationNode::GetNext + 0x21 - partitioned-aggregation-node.cc:376
      impalad ! impala::PlanFragmentExecutor::ExecInternal + 0x192 - plan-fragment-executor.cc:361
      impalad ! impala::PlanFragmentExecutor::Exec + 0x17e - plan-fragment-executor.cc:339
      impalad ! impala::FragmentMgr::FragmentExecState::Exec + 0xdf - fragment-exec-state.cc:54
      impalad ! impala::FragmentMgr::FragmentThread + 0x39 - fragment-mgr.cc:86
      impalad ! boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>::operator() + 0x42 - mem_fn_template.hpp:165
      impalad ! operator()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0> - bind.hpp:313
      impalad ! boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>::operator() - bind_template.hpp:20
      impalad ! boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId>>>, void>::invoke + 0x7 - function_template.hpp:153
      impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
      impalad ! impala::Thread::SuperviseThread + 0x20e - thread.cc:317
      impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
      impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
      impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
      impalad ! thread_proxy + 0xd9 - [unknown source file]
      libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
      libc.so.6 ! clone + 0x6c - [unknown source file]
      

        Issue Links

          Activity

          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          I tried clear Vs. FreeAl and the memory gets accumilated in the Agg node

          Operator              #Hosts   Avg Time   Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
          --------------------------------------------------------------------------------------------------------------------------------
          05:MERGING-EXCHANGE        1  253.770us  253.770us      0           4          0        -1.00 B  UNPARTITIONED                  
          02:TOP-N                  14  138.455us  396.191us      0           4   12.00 KB       986.00 B                                 
          04:AGGREGATE              14    2.416ms    3.128ms      0           4    2.45 MB       10.00 MB  FINALIZE                       
          03:EXCHANGE               14    8.099us    9.408us      0           4          0              0  HASH(l_orderkey,l_partkey,l... 
          01:AGGREGATE              14    3s780ms    4s350ms      0           4   30.16 GB       10.00 MB  STREAMING                      
          00:SCAN HDFS              14   27s310ms   28s538ms      0           4  454.83 MB      176.00 MB  tpch_300_parquet.lineitem
          

          Also the timing in the agg node is still higher than usual, due to https://github.com/apache/incubator-impala/blob/master/be/src/exec/partitioned-aggregation-node.cc#L593

          CPU Time
          1 of 9: 94.5% (3.460s of 3.660s)
          
          libc.so.6 ! madvise - [unknown source file]
          impalad ! TCMalloc_SystemRelease + 0x79 - [unknown source file]
          impalad ! tcmalloc::PageHeap::DecommitSpan + 0x20 - [unknown source file]
          impalad ! tcmalloc::PageHeap::MergeIntoFreeList + 0x212 - [unknown source file]
          impalad ! tcmalloc::PageHeap::Delete + 0x23 - [unknown source file]
          impalad ! tcmalloc::CentralFreeList::ReleaseToSpans + 0x10f - [unknown source file]
          impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
          impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
          impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
          impalad ! tcmalloc::ThreadCache::ListTooLong + 0x1b - [unknown source file]
          impalad ! operator delete + 0x3bf - [unknown source file]
          impalad ! impala::MemPool::FreeAll + 0x59 - mem-pool.cc:90
          impalad ! impala::RowBatch::~RowBatch + 0x19 - row-batch.cc:153
          impalad ! boost::checked_delete<impala::RowBatch> + 0xd - checked_delete.hpp:34
          impalad ! ~scoped_ptr + 0x4 - scoped_ptr.hpp:82
          impalad ! boost::scoped_ptr<impala::RowBatch>::reset + 0x12 - scoped_ptr.hpp:88
          impalad ! impala::PartitionedAggregationNode::GetRowsStreaming + 0x239 - partitioned-aggregation-node.cc:593
          impalad ! impala::PartitionedAggregationNode::GetNextInternal + 0x230 - partitioned-aggregation-node.cc:451
          impalad ! impala::PartitionedAggregationNode::GetNext + 0x21 - partitioned-aggregation-node.cc:376
          impalad ! impala::PlanFragmentExecutor::ExecInternal + 0x18c - plan-fragment-executor.cc:360
          impalad ! impala::PlanFragmentExecutor::Exec + 0x14c - plan-fragment-executor.cc:337
          impalad ! impala::FragmentInstanceState::Exec + 0xe7 - fragment-instance-state.cc:66
          impalad ! impala::QueryExecMgr::ExecFInstance + 0x1e - query-exec-mgr.cc:109
          impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
          impalad ! impala::Thread::SuperviseThread + 0x209 - thread.cc:317
          impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
          impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
          impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
          impalad ! thread_proxy + 0xd9 - [unknown source file]
          libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
          libc.so.6 ! clone + 0x6c - [unknown source file]
          
          Show
          mmokhtar Mostafa Mokhtar added a comment - I tried clear Vs. FreeAl and the memory gets accumilated in the Agg node Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail -------------------------------------------------------------------------------------------------------------------------------- 05:MERGING-EXCHANGE 1 253.770us 253.770us 0 4 0 -1.00 B UNPARTITIONED 02:TOP-N 14 138.455us 396.191us 0 4 12.00 KB 986.00 B 04:AGGREGATE 14 2.416ms 3.128ms 0 4 2.45 MB 10.00 MB FINALIZE 03:EXCHANGE 14 8.099us 9.408us 0 4 0 0 HASH(l_orderkey,l_partkey,l... 01:AGGREGATE 14 3s780ms 4s350ms 0 4 30.16 GB 10.00 MB STREAMING 00:SCAN HDFS 14 27s310ms 28s538ms 0 4 454.83 MB 176.00 MB tpch_300_parquet.lineitem Also the timing in the agg node is still higher than usual, due to https://github.com/apache/incubator-impala/blob/master/be/src/exec/partitioned-aggregation-node.cc#L593 CPU Time 1 of 9: 94.5% (3.460s of 3.660s) libc.so.6 ! madvise - [unknown source file] impalad ! TCMalloc_SystemRelease + 0x79 - [unknown source file] impalad ! tcmalloc::PageHeap::DecommitSpan + 0x20 - [unknown source file] impalad ! tcmalloc::PageHeap::MergeIntoFreeList + 0x212 - [unknown source file] impalad ! tcmalloc::PageHeap::Delete + 0x23 - [unknown source file] impalad ! tcmalloc::CentralFreeList::ReleaseToSpans + 0x10f - [unknown source file] impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file] impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file] impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file] impalad ! tcmalloc::ThreadCache::ListTooLong + 0x1b - [unknown source file] impalad ! operator delete + 0x3bf - [unknown source file] impalad ! impala::MemPool::FreeAll + 0x59 - mem-pool.cc:90 impalad ! impala::RowBatch::~RowBatch + 0x19 - row-batch.cc:153 impalad ! boost::checked_delete<impala::RowBatch> + 0xd - checked_delete.hpp:34 impalad ! ~scoped_ptr + 0x4 - scoped_ptr.hpp:82 impalad ! boost::scoped_ptr<impala::RowBatch>::reset + 0x12 - scoped_ptr.hpp:88 impalad ! impala::PartitionedAggregationNode::GetRowsStreaming + 0x239 - partitioned-aggregation-node.cc:593 impalad ! impala::PartitionedAggregationNode::GetNextInternal + 0x230 - partitioned-aggregation-node.cc:451 impalad ! impala::PartitionedAggregationNode::GetNext + 0x21 - partitioned-aggregation-node.cc:376 impalad ! impala::PlanFragmentExecutor::ExecInternal + 0x18c - plan-fragment-executor.cc:360 impalad ! impala::PlanFragmentExecutor::Exec + 0x14c - plan-fragment-executor.cc:337 impalad ! impala::FragmentInstanceState::Exec + 0xe7 - fragment-instance-state.cc:66 impalad ! impala::QueryExecMgr::ExecFInstance + 0x1e - query-exec-mgr.cc:109 impalad ! boost::function0<void>:: operator () + 0x1a - function_template.hpp:767 impalad ! impala:: Thread ::SuperviseThread + 0x209 - thread.cc:317 impalad ! operator ()<void (*)( const std::basic_string< char >&, const std::basic_string< char >&, boost::function<void()>, impala::Promise< long int >*), boost::_bi::list0> + 0x5a - bind.hpp:457 impalad ! boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>:: operator () - bind_template.hpp:20 impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void (void)>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise< long >*>>>>::run + 0x19 - thread.hpp:116 impalad ! thread_proxy + 0xd9 - [unknown source file] libpthread.so.0 ! start_thread + 0xd0 - [unknown source file] libc.so.6 ! clone + 0x6c - [unknown source file]
          Hide
          tarmstrong Tim Armstrong added a comment -

          I think the issue is the interaction between AcquireData() and Clear() causing a lot of chunks to accumulate. We could probably adjust the behaviour of Clear() so that it cached some memory up to a threshold, but freed memory above that threshold.

          Show
          tarmstrong Tim Armstrong added a comment - I think the issue is the interaction between AcquireData() and Clear() causing a lot of chunks to accumulate. We could probably adjust the behaviour of Clear() so that it cached some memory up to a threshold, but freed memory above that threshold.
          Hide
          alex.behm Alexander Behm added a comment -

          Also, for our multi-threaded scans today, using Clear() will not help because the scanner threads populate data into a different mem pool, and when returning batches from the scan node that memory gets transferred. So having the consumer of the scan node call Clear() will only lead to accumulating more chunks (that will never be reused).

          Show
          alex.behm Alexander Behm added a comment - Also, for our multi-threaded scans today, using Clear() will not help because the scanner threads populate data into a different mem pool, and when returning batches from the scan node that memory gets transferred. So having the consumer of the scan node call Clear() will only lead to accumulating more chunks (that will never be reused).
          Hide
          tarmstrong Tim Armstrong added a comment -

          Alexander Behm ah that's true - it doesn't help for that case. So I think we're hitting the producer-consumer thread case in TCMalloc that renders the thread-local caches ineffective.

          The long-term fix would be to allocate MemPool chunks from BufferPool to avoid relying on TCMalloc. I'm not sure what the allocation pattern is like (it'd be interesting to see how large the chunks being freed are), but it might also be possible to allocate fewer MemPool chunks if we adjusted the MemPool sizing algorithm start with a larger chunk size.

          Show
          tarmstrong Tim Armstrong added a comment - Alexander Behm ah that's true - it doesn't help for that case. So I think we're hitting the producer-consumer thread case in TCMalloc that renders the thread-local caches ineffective. The long-term fix would be to allocate MemPool chunks from BufferPool to avoid relying on TCMalloc. I'm not sure what the allocation pattern is like (it'd be interesting to see how large the chunks being freed are), but it might also be possible to allocate fewer MemPool chunks if we adjusted the MemPool sizing algorithm start with a larger chunk size.
          Hide
          tarmstrong Tim Armstrong added a comment -

          This may also be a symptom of the TCMalloc thread caches being too small. It's tweakable with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES https://github.com/gperftools/gperftools/blob/master/docs/tcmalloc.html#L506

          Show
          tarmstrong Tim Armstrong added a comment - This may also be a symptom of the TCMalloc thread caches being too small. It's tweakable with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES https://github.com/gperftools/gperftools/blob/master/docs/tcmalloc.html#L506
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-5304 may help a bit here by reducing the number of MemPool chunks attached to the batches.

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-5304 may help a bit here by reducing the number of MemPool chunks attached to the batches.
          Hide
          tarmstrong Tim Armstrong added a comment -

          I took another look at the code and added some logging to understand what was happening here. I think there may be a relatively easy fix. What is happening in a very selective scan is:

          1. Allocate a tuple buffer for 1024 rows in the scratch batch
          2. Materialize 1024 rows
          3. Filter out those 1024 rows
          4. Transfer the tuple buffer to the output batch
          5. If the output batch is over capacity, return it. Otherwise go back to the start.

          This means that the output batch can accumulate a large batch of unused tuple buffers before it is sent up the tree. If we reused the unused tuple buffers instead of unconditionally transferring them we might be able to avoid the excessive allocations. I think the way the allocations accumulate in the destination batch also means the allocations and frees happen in large batches that are larger than the TCMalloc thread caches.

          Show
          tarmstrong Tim Armstrong added a comment - I took another look at the code and added some logging to understand what was happening here. I think there may be a relatively easy fix. What is happening in a very selective scan is: Allocate a tuple buffer for 1024 rows in the scratch batch Materialize 1024 rows Filter out those 1024 rows Transfer the tuple buffer to the output batch If the output batch is over capacity, return it. Otherwise go back to the start. This means that the output batch can accumulate a large batch of unused tuple buffers before it is sent up the tree. If we reused the unused tuple buffers instead of unconditionally transferring them we might be able to avoid the excessive allocations. I think the way the allocations accumulate in the destination batch also means the allocations and frees happen in large batches that are larger than the TCMalloc thread caches.
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-4923: reduce memory transfer for selective scans

          Most of the code changes are to restructure things so that the
          scratch batch's tuple buffer is stored in a separate MemPool
          from auxiliary memory such as decompression buffers. This part
          of the change does not change the behaviour of the scanner in
          itself, but allows us to recycle the tuple buffer without holding
          onto unused auxiliary memory.

          The optimisation is implemented in TryCompact(): if enough rows
          were filtered out during the copy from the scratch batch to the
          output batch, the fixed-length portions of the surviving rows
          (if any) are copied to a new, smaller, buffer, and the original,
          larger, buffer is reused for the next scratch batch.

          Previously the large buffer was always attached to the output batch,
          so a large buffer was transferred between threads for every scratch
          batch processed. In combination with the decompression buffer change
          in IMPALA-5304, this means that in many cases selective scans don't
          produce nearly as many empty or near-empty batches and do not attach
          nearly as much memory to each batch.

          Performance:
          Even on an 8 core machine I see some speedup on selective scans.
          Profiling with "perf top" also showed that time in TCMalloc
          was reduced - it went from several % of CPU time to a minimal
          amount.

          Running TPC-H on the same machine showed a ~5% overall improvement
          and no regressions. E.g. Q6 got 20-25% faster.

          I hope to do some additional cluster benchmarking on systems
          with more cores to verify that the severe performance problems
          there are fixed, but in the meantime it seems like we have enough
          evidence that it will at least improve things.

          Testing:
          Add a couple of selective scans that exercise the new code paths.

          Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea
          Reviewed-on: http://gerrit.cloudera.org:8080/6949
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-4923 : reduce memory transfer for selective scans Most of the code changes are to restructure things so that the scratch batch's tuple buffer is stored in a separate MemPool from auxiliary memory such as decompression buffers. This part of the change does not change the behaviour of the scanner in itself, but allows us to recycle the tuple buffer without holding onto unused auxiliary memory. The optimisation is implemented in TryCompact(): if enough rows were filtered out during the copy from the scratch batch to the output batch, the fixed-length portions of the surviving rows (if any) are copied to a new, smaller, buffer, and the original, larger, buffer is reused for the next scratch batch. Previously the large buffer was always attached to the output batch, so a large buffer was transferred between threads for every scratch batch processed. In combination with the decompression buffer change in IMPALA-5304 , this means that in many cases selective scans don't produce nearly as many empty or near-empty batches and do not attach nearly as much memory to each batch. Performance: Even on an 8 core machine I see some speedup on selective scans. Profiling with "perf top" also showed that time in TCMalloc was reduced - it went from several % of CPU time to a minimal amount. Running TPC-H on the same machine showed a ~5% overall improvement and no regressions. E.g. Q6 got 20-25% faster. I hope to do some additional cluster benchmarking on systems with more cores to verify that the severe performance problems there are fixed, but in the meantime it seems like we have enough evidence that it will at least improve things. Testing: Add a couple of selective scans that exercise the new code paths. Change-Id: I3773dc63c498e295a2c1386a15c5e69205e747ea Reviewed-on: http://gerrit.cloudera.org:8080/6949 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins —

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development