Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5357

Reading Kudu timestamp causes severe kernel spinning due to locking in impala::TimestampValue::UnixTimeToPtime-> __tz_convert

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:
    • Epic Color:
      ghx-label-6

      Description

      Query

      select l_commitdate, count(*) from lineitem_kudu_ts where l_shipdate between "1992-01-01" and "1992-03-01" group by l_commitdate limit 10
      

      Call stack

      Thread 1 (process 90064):
      #0  0x00000037970f80cf in __lll_unlock_wake_private () from /lib64/libc.so.6
      #1  0x000000379709dcfc in _L_unlock_2225 () from /lib64/libc.so.6
      #2  0x000000379709db31 in __tz_convert () from /lib64/libc.so.6
      #3  0x0000000000a4ca75 in impala::TimestampValue::UnixTimeToPtime(long) ()
      #4  0x0000000000d3aad2 in impala::KuduScanner::DecodeRowsIntoRowBatch(impala::RowBatch*, impala::Tuple**) ()
      #5  0x0000000000d3b05f in impala::KuduScanner::GetNext(impala::RowBatch*, bool*) ()
      #6  0x0000000000cfa5a3 in impala::KuduScanNode::ProcessScanToken(impala::KuduScanner*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
      #7  0x0000000000cfb024 in impala::KuduScanNode::RunScannerThread(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*) ()
      #8  0x0000000000bed5b9 in impala::Thread::SuperviseThread(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()()>, impala::Promise<long>*) ()
      #9  0x0000000000bedf74 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
      #10 0x0000000000e529ca in ?? ()
      #11 0x0000003797407aa1 in start_thread () from /lib64/libpthread.so.0
      #12 0x00000037970e893d in clone () from /lib64/libc.so.6
      

      Table DDL

       CREATE TABLE scan_primitives_tpch_3tb.lineitem_kudu_ts (                                                                                                                  
         l_shipdate STRING NOT NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                      
         l_orderkey BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                        
         l_linenumber BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                      
         l_partkey BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                         
         l_suppkey BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                         
         l_quantity DOUBLE NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                            
         l_extendedprice DOUBLE NULL ENCODING PLAIN_ENCODING COMPRESSION LZ4,                                                                                                    
         l_discount DOUBLE NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                            
         l_tax DOUBLE NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                                 
         l_returnflag STRING NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                        
         l_linestatus STRING NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                        
         l_commitdate TIMESTAMP NULL ENCODING BIT_SHUFFLE COMPRESSION LZ4,                                                                                                       
         l_receiptdate STRING NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                       
         l_shipinstruct STRING NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                      
         l_shipmode STRING NULL ENCODING DICT_ENCODING COMPRESSION LZ4,                                                                                                          
         l_comment STRING NULL ENCODING PLAIN_ENCODING COMPRESSION LZ4,                                                                                                          
         PRIMARY KEY (l_shipdate, l_orderkey, l_linenumber)                                                                                                                      
       )                                                                                                                                                                         
       PARTITION BY HASH (l_orderkey) PARTITIONS 140                                                                                                                             
       STORED AS KUDU                                                                                                                                                            
       TBLPROPERTIES ('numRows'='17614285211', 'kudu.master_addresses'='foo:7051,foo1:7051', 'STATS_GENERATED_VIA_STATS_TASK'='true') 
       
      1. cast string to timestamp scan-agg_profile.txt
        11 kB
        Mostafa Mokhtar
      2. Screen Shot 2017-05-23 at 6.57.12 PM.png
        162 kB
        Mostafa Mokhtar
      3. string_scan-agg_profile.txt
        11 kB
        Mostafa Mokhtar
      4. timestamp_scan-agg_profile.txt
        12 kB
        Mostafa Mokhtar

        Issue Links

          Activity

          Hide
          mjacobs Matthew Jacobs added a comment -

          Looks like this is the same underlying issue as IMPALA-3316. I'll look
          to see if there is some other way to convert a time_t to a ptime.

          Show
          mjacobs Matthew Jacobs added a comment - Looks like this is the same underlying issue as IMPALA-3316 . I'll look to see if there is some other way to convert a time_t to a ptime.
          Hide
          mjacobs Matthew Jacobs added a comment -

          This issue already exists when reading parquet files written by Hive and using the -convert_legacy_hive_parquet_utc_timestamps flag, (IMPALA-3316). I won't mark this JIRA as a duplicate because it'll be useful to track by scenario, but that path is actually going to be much more commonly hit than the Kudu TIMESTAMP path, because Kudu is new and the timestamp support for Kudu is only being added in Impala 2.9.

          It looks like the affected code was introduced by the fix for IMPALA-1676 (commit 5d652421 ) which explicitly added the calls to the glibc conversion functions that are affected. That was changed to avoid the boost conversion function ptime::from_time_t which doesn't allow dates above the year 2037. This is a known bug [1] which has recently been fixed upstream [2] though probably won't even make it into the upcoming 1.64 release (see comments in the ticket).

          I will try patching boost in the Impala toolchain and switching back to the boost::from_time_t function, but I'd be concerned there could be subtle differences in how times are converted.

          I think we should try to fix this but I don't believe this is a blocker seeing as it's not a regression. If we can't address it correctly in time, I do not think we should back out the timestamp functionality, but rather guide users that it is only ready for non
          perf sensitive scenarios. Users who care about perf for Kudu tables with timestamps should continue using BIGINTs as they are doing.

          1: https://svn.boost.org/trac/boost/ticket/4543
          2: https://github.com/boostorg/date_time/pull/35/commits/818dea52f3f0c8b4f5172df13c5fa4f57340625f

          Show
          mjacobs Matthew Jacobs added a comment - This issue already exists when reading parquet files written by Hive and using the -convert_legacy_hive_parquet_utc_timestamps flag, ( IMPALA-3316 ). I won't mark this JIRA as a duplicate because it'll be useful to track by scenario, but that path is actually going to be much more commonly hit than the Kudu TIMESTAMP path, because Kudu is new and the timestamp support for Kudu is only being added in Impala 2.9. It looks like the affected code was introduced by the fix for IMPALA-1676 (commit 5d652421 ) which explicitly added the calls to the glibc conversion functions that are affected. That was changed to avoid the boost conversion function ptime::from_time_t which doesn't allow dates above the year 2037. This is a known bug [1] which has recently been fixed upstream [2] though probably won't even make it into the upcoming 1.64 release (see comments in the ticket). I will try patching boost in the Impala toolchain and switching back to the boost::from_time_t function, but I'd be concerned there could be subtle differences in how times are converted. I think we should try to fix this but I don't believe this is a blocker seeing as it's not a regression. If we can't address it correctly in time, I do not think we should back out the timestamp functionality, but rather guide users that it is only ready for non perf sensitive scenarios. Users who care about perf for Kudu tables with timestamps should continue using BIGINTs as they are doing. 1: https://svn.boost.org/trac/boost/ticket/4543 2: https://github.com/boostorg/date_time/pull/35/commits/818dea52f3f0c8b4f5172df13c5fa4f57340625f
          Hide
          mjacobs Matthew Jacobs added a comment -

          I spent a few hours trying to get a patched build of boost to work, but I can't get it to represent the full date range that we support e.g. FromUnixTime(-17987443200) should be the year 1400, but it gives the wrong result and I can't find the source of the issue in boost. I'm leaning towards calling the patched boost option being too difficult to attempt for Impala 2.9.

          Show
          mjacobs Matthew Jacobs added a comment - I spent a few hours trying to get a patched build of boost to work, but I can't get it to represent the full date range that we support e.g. FromUnixTime(-17987443200) should be the year 1400, but it gives the wrong result and I can't find the source of the issue in boost. I'm leaning towards calling the patched boost option being too difficult to attempt for Impala 2.9.
          Hide
          mjacobs Matthew Jacobs added a comment -

          commit dad4896693c99f6a59be5c25bed2297a1ed57c5f
          Author: Matthew Jacobs <mj@cloudera.com>
          Date: Mon Jun 5 12:28:43 2017 -0700

          IMPALA-5357: Fix unixtime to UTC TimestampValue perf

          Fixes the severe perf issue when calling gmtime_r to convert
          a unix time because that libc function takes a global lock.
          Instead, use boost ptime::from_time_t when possible.
          Unfortunately only a range of input times are supported
          without overflowing the underlying time_duration struct
          (dates between 1677-2262), so those dates outside that range
          are handled with the slow path calling into gmtime_r.

          This requires using a patched build of boost which
          includes an upstream fix for the time_duration class, see:
          https://github.com/cloudera/native-toolchain/commit/6e726b4b8164d53814f75b78a681fcf6df4a887a

          Testing:

          • Ran an exhaustive test run successfully.
          • Wrote a test program to validate that all time_t values
            between 1677 and 2262 are converted to the same ptime
            when using the new code path and the old path. Doing so
            required running all night, so I'm not going to attempt to
            add this test. I think a single validation is acceptable.

          Perf impact:
          Microbenchmark of the new path (conversion using boost) and
          the old path (conversion using libc gmtime_r) resulted in
          the expected speedup from not using a global lock.

          Machine Info: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
          Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile
          iters/ms iters/ms iters/ms (relative) (relative) (relative)
          ---------------------------------------------------------------------
          libc-1 0.192 0.196 0.2 1X 1X 1X
          boost-1 0.333 0.34 0.34 1.73X 1.73X 1.7X
          libc-2 0.112 0.115 0.116 0.584X 0.586X 0.581X
          boost-2 0.627 0.654 0.667 3.26X 3.33X 3.33X
          libc-4 0.042 0.0478 0.0515 0.218X 0.244X 0.258X
          boost-4 0.863 1.27 1.3 4.49X 6.5X 6.5X
          libc-8 0.0326 0.0328 0.0329 0.169X 0.167X 0.164X
          boost-8 0.741 0.902 1.1 3.85X 4.6X 5.5X

          Change-Id: I9d611d21310c7b4f93d8e0bc845eb85125abcac9
          Reviewed-on: http://gerrit.cloudera.org:8080/7082
          Reviewed-by: Matthew Jacobs <mj@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          mjacobs Matthew Jacobs added a comment - commit dad4896693c99f6a59be5c25bed2297a1ed57c5f Author: Matthew Jacobs <mj@cloudera.com> Date: Mon Jun 5 12:28:43 2017 -0700 IMPALA-5357 : Fix unixtime to UTC TimestampValue perf Fixes the severe perf issue when calling gmtime_r to convert a unix time because that libc function takes a global lock. Instead, use boost ptime::from_time_t when possible. Unfortunately only a range of input times are supported without overflowing the underlying time_duration struct (dates between 1677-2262), so those dates outside that range are handled with the slow path calling into gmtime_r. This requires using a patched build of boost which includes an upstream fix for the time_duration class, see: https://github.com/cloudera/native-toolchain/commit/6e726b4b8164d53814f75b78a681fcf6df4a887a Testing: Ran an exhaustive test run successfully. Wrote a test program to validate that all time_t values between 1677 and 2262 are converted to the same ptime when using the new code path and the old path. Doing so required running all night, so I'm not going to attempt to add this test. I think a single validation is acceptable. Perf impact: Microbenchmark of the new path (conversion using boost) and the old path (conversion using libc gmtime_r) resulted in the expected speedup from not using a global lock. Machine Info: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile iters/ms iters/ms iters/ms (relative) (relative) (relative) --------------------------------------------------------------------- libc-1 0.192 0.196 0.2 1X 1X 1X boost-1 0.333 0.34 0.34 1.73X 1.73X 1.7X libc-2 0.112 0.115 0.116 0.584X 0.586X 0.581X boost-2 0.627 0.654 0.667 3.26X 3.33X 3.33X libc-4 0.042 0.0478 0.0515 0.218X 0.244X 0.258X boost-4 0.863 1.27 1.3 4.49X 6.5X 6.5X libc-8 0.0326 0.0328 0.0329 0.169X 0.167X 0.164X boost-8 0.741 0.902 1.1 3.85X 4.6X 5.5X Change-Id: I9d611d21310c7b4f93d8e0bc845eb85125abcac9 Reviewed-on: http://gerrit.cloudera.org:8080/7082 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          mjacobs Matthew Jacobs added a comment -

          I validated that on Mostafa Mokhtar's original cluster, the original query which was taking ~52sec (see timestamp_scan-agg_profile.txt) now runs in 6sec, so almost 8x improvement over the time seen in the attached profile. The global lock bottleneck is clearly gone, which was the point of this JIRA. We should continue to evaluate future TIMESTAMP perf improvements: a similar query on a string column took about 4sec, so TIMESTAMP still lags in performance, almost certainly due to the use of boost, but we haven't done targeted profiling/careful analysis.

          Show
          mjacobs Matthew Jacobs added a comment - I validated that on Mostafa Mokhtar 's original cluster, the original query which was taking ~52sec (see timestamp_scan-agg_profile.txt) now runs in 6sec, so almost 8x improvement over the time seen in the attached profile. The global lock bottleneck is clearly gone, which was the point of this JIRA. We should continue to evaluate future TIMESTAMP perf improvements: a similar query on a string column took about 4sec, so TIMESTAMP still lags in performance, almost certainly due to the use of boost, but we haven't done targeted profiling/careful analysis.

            People

            • Assignee:
              mjacobs Matthew Jacobs
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development