Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4983

Regression in exchange operators introduced by LZ4 1.7.5 upgrade

    Details

      Description

      "IMPALA-4926: Upgrade LZ4 to 1.7.5" introduced a regression in exchange mostly due to increased CPU time in LZ4_compress_default and

      CPU Time
      1 of 33: 31.6% (6.590s of 20.840s)
      
      impalad ! LZ4_compress_fast_extState - [unknown source file]
      impalad ! LZ4_compress_fast + 0x69 - [unknown source file]
      impalad ! LZ4_compress_default + 0x31 - [unknown source file]
      impalad ! impala::Lz4Compressor::ProcessBlock + 0x2e - compress.cc:303
      impalad ! impala::RowBatch::Serialize + 0x219 - row-batch.cc:212
      impalad ! impala::RowBatch::Serialize + 0x25 - row-batch.cc:169
      impalad ! impala::DataStreamSender::SerializeBatch + 0x1c5 - data-stream-sender.cc:495
      impalad ! impala::DataStreamSender::Channel::SendCurrentBatch + 0x35 - data-stream-sender.cc:269
      impalad ! impala::DataStreamSender::Channel::AddRow + 0xec - data-stream-sender.cc:250
      impalad ! impala::DataStreamSender::Send + 0x23b - data-stream-sender.cc:459
      impalad ! impala::PlanFragmentExecutor::ExecInternal + 0x24a - plan-fragment-executor.cc:366
      impalad ! impala::PlanFragmentExecutor::Exec + 0x14c - plan-fragment-executor.cc:337
      impalad ! impala::FragmentInstanceState::Exec + 0xe7 - fragment-instance-state.cc:68
      impalad ! impala::QueryExecMgr::ExecFInstance + 0x1e - query-exec-mgr.cc:110
      impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
      impalad ! impala::Thread::SuperviseThread + 0x209 - thread.cc:317
      impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
      impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
      impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
      impalad ! thread_proxy + 0xd9 - [unknown source file]
      libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
      libc.so.6 ! clone + 0x6c - [unknown source file]
      
      CPU Time
      1 of 17: 35.7% (2.310s of 6.470s)
      
      impalad ! LZ4_decompress_fast - [unknown source file]
      impalad ! impala::Lz4Decompressor::ProcessBlock + 0x3a - decompress.cc:565
      impalad ! impala::RowBatch::RowBatch + 0x18d - row-batch.cc:110
      impalad ! impala::DataStreamRecvr::SenderQueue::AddBatch + 0x445 - data-stream-recvr.cc:211
      impalad ! impala::DataStreamMgr::AddData + 0x11c - data-stream-mgr.cc:189
      impalad ! impala::ImpalaServer::TransmitData + 0x140 - impala-server.cc:1109
      impalad ! impala::ImpalaInternalServiceProcessor::process_TransmitData + 0x2c0 - ImpalaInternalService.cpp:1559
      impalad ! impala::ImpalaInternalServiceProcessor::dispatchCall + 0xe8 - ImpalaInternalService.cpp:1370
      impalad ! apache::thrift::TDispatchProcessor::process + 0xab - TDispatchProcessor.h:121
      impalad ! apache::thrift::server::TAcceptQueueServer::Task::run + 0x15e - TAcceptQueueServer.cpp:76
      impalad ! impala::ThriftThread::RunRunnable + 0x68 - thrift-thread.cc:64
      impalad ! boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>::operator() + 0x3a - mem_fn_template.hpp:280
      impalad ! operator()<boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<long unsigned int>*>, boost::_bi::list0> + 0x1e - bind.hpp:392
      impalad ! boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable>>, boost::_bi::value<impala::Promise<unsigned long>*>>>::operator() - bind_template.hpp:20
      impalad ! boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable>>, boost::_bi::value<impala::Promise<unsigned long>*>>>, void>::invoke + 0x9 - function_template.hpp:153
      impalad ! boost::function0<void>::operator() + 0x1a - function_template.hpp:767
      impalad ! impala::Thread::SuperviseThread + 0x209 - thread.cc:317
      impalad ! operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> + 0x5a - bind.hpp:457
      impalad ! boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>::operator() - bind_template.hpp:20
      impalad ! boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void (void)>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void (void)>>, boost::_bi::value<impala::Promise<long>*>>>>::run + 0x19 - thread.hpp:116
      impalad ! thread_proxy + 0xd9 - [unknown source file]
      libpthread.so.0 ! start_thread + 0xd0 - [unknown source file]
      libc.so.6 ! clone + 0x6c - [unknown source file]
      
      1. regressed_primitive_exchange_broadcast.txt
        328 kB
        Mostafa Mokhtar
      2. CallStacks.csv
        20 kB
        Mostafa Mokhtar
      3. baselin_primitive_exchange_broadcast.txt
        331 kB
        Mostafa Mokhtar

        Activity

        Hide
        jbapple Jim Apple added a comment -

        Wasn't this created and fixed in the same release, thus avoiding the need for any docs mentions?

        Show
        jbapple Jim Apple added a comment - Wasn't this created and fixed in the same release, thus avoiding the need for any docs mentions?
        Hide
        jrussell John Russell added a comment -

        Mind if I add the 'regression' label to this JIRA (for filtering purposes)?

        Show
        jrussell John Russell added a comment - Mind if I add the 'regression' label to this JIRA (for filtering purposes)?
        Show
        henryr Henry Robinson added a comment - https://github.com/apache/incubator-impala/commit/674d313251eeb94c9b1bd7eaec4c641709d395ca
        Hide
        henryr Henry Robinson added a comment -

        It looks like this might be caused by LZ4 not being built with -O2 or higher. I was able to reproduce a regression on a broadcast exchange:

        SELECT /* +straight_join */ count(*)
        FROM nation
        JOIN /* +broadcast */ lineitem ON l_orderkey = n_nationkey
        WHERE l_partkey > 0
          AND l_suppkey > 0
          AND l_linenumber > 0
          AND n_regionkey < 0
          AND l_orderkey < 350000000;
        

        and saw this go away on my single-node 3-impalad cluster with an optimized LZ4 binary.

        What's interesting is that AFAICT, LZ4 was not built with optimization in previous versions either. So perhaps the unoptimized performance regressed. This thread makes it sound like gcc 4.x had a hard time optimizing recent lz4 at various levels, so perhaps that's also a factor.

        Show
        henryr Henry Robinson added a comment - It looks like this might be caused by LZ4 not being built with -O2 or higher. I was able to reproduce a regression on a broadcast exchange: SELECT /* +straight_join */ count(*) FROM nation JOIN /* +broadcast */ lineitem ON l_orderkey = n_nationkey WHERE l_partkey > 0 AND l_suppkey > 0 AND l_linenumber > 0 AND n_regionkey < 0 AND l_orderkey < 350000000; and saw this go away on my single-node 3-impalad cluster with an optimized LZ4 binary. What's interesting is that AFAICT, LZ4 was not built with optimization in previous versions either. So perhaps the unoptimized performance regressed. This thread makes it sound like gcc 4.x had a hard time optimizing recent lz4 at various levels, so perhaps that's also a factor.
        Hide
        henryr Henry Robinson added a comment -

        I mean how big is the difference in that callstack specifically? Our row batch serialization benchmark doesn't show a large difference with the previous or current versions of Lz4. So I want to see how big the difference is so I recognize it when I see it. I certainly don't see a 3x slowdown (maybe 1%), but I can experiment with the primitive tuple layout.

        Show
        henryr Henry Robinson added a comment - I mean how big is the difference in that callstack specifically? Our row batch serialization benchmark doesn't show a large difference with the previous or current versions of Lz4. So I want to see how big the difference is so I recognize it when I see it. I certainly don't see a 3x slowdown (maybe 1%), but I can experiment with the primitive tuple layout.
        Hide
        mmokhtar Mostafa Mokhtar added a comment -

        Henry Robinson
        primitive_exchange_broadcast regressed by 3x, 70s -> 215s.

        Profiles attached
        regressed_primitive_exchange_broadcast.txt baselin_primitive_exchange_broadcast.txt

        Show
        mmokhtar Mostafa Mokhtar added a comment - Henry Robinson primitive_exchange_broadcast regressed by 3x, 70s -> 215s. Profiles attached regressed_primitive_exchange_broadcast.txt baselin_primitive_exchange_broadcast.txt
        Hide
        henryr Henry Robinson added a comment -

        How big is the regression in the LZ4 compression time?

        Show
        henryr Henry Robinson added a comment - How big is the regression in the LZ4 compression time?

          People

          • Assignee:
            henryr Henry Robinson
            Reporter:
            mmokhtar Mostafa Mokhtar
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development