Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4055

Investigate and fix to_date() slowness

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      Looks like to_date() pays a steep penalty when converting timestamps.

      +-----------------------------------------------------------------------------------------+
      | version()                                                                               |
      +-----------------------------------------------------------------------------------------+
      | impalad version 2.6.0-cdh5.8.0 RELEASE (build 5464d1750381b40a7e7163b12b09f11b891b4de3) |
      | Built on Thu, 16 Jun 2016 12:43:48 PST                                                  |
      +-----------------------------------------------------------------------------------------+
      
      -- single column timestamp parquet table of 100,000,000 rows
      
      select 
        l_shipdate,
        count(*)
      from ts1
      group by 1;
      
      Fetched 2526 row(s) in 11.25s
      
      select 
        trunc(l_shipdate,'DD'),
        count(*)
      from ts1
      group by 1;
      
      Fetched 2526 row(s) in 10.74s
      
      select 
        to_date(l_shipdate),
        count(*)
      from ts1
      group by 1;
      
      Fetched 2526 row(s) in 102.36s  <<< ~10x slower
      

        Issue Links

          Activity

          Hide
          alex.behm Alexander Behm added a comment -

          commit 6154a695b738c654fd8f094a5e6d31b1018ff687
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Tue Jan 24 00:03:14 2017 -0800

          IMPALA-4055: Speed up to_date() with custom implementation.

          Simple implementation of to_date() that avoids calling
          into boost for a speedup of 10x.

          Perf:
          I generated a synthetic Parquet table with 26437248 rows
          and a single timestamp column. I tested the response time
          of the following query before and after this change.

          set mt_dop=1;
          select count from to_date_test
          where to_date(ts) = '2017-10-23';

          Before: 38.1s
          After: 3.4s

          Testing: I locally ran expr-test.cc and expr_test.py.

          Change-Id: I5713b3e0c27b739aae597a6911cf3b2ddd01f822
          Reviewed-on: http://gerrit.cloudera.org:8080/5791
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 6154a695b738c654fd8f094a5e6d31b1018ff687 Author: Alex Behm <alex.behm@cloudera.com> Date: Tue Jan 24 00:03:14 2017 -0800 IMPALA-4055 : Speed up to_date() with custom implementation. Simple implementation of to_date() that avoids calling into boost for a speedup of 10x. Perf: I generated a synthetic Parquet table with 26437248 rows and a single timestamp column. I tested the response time of the following query before and after this change. set mt_dop=1; select count from to_date_test where to_date(ts) = '2017-10-23'; Before: 38.1s After: 3.4s Testing: I locally ran expr-test.cc and expr_test.py. Change-Id: I5713b3e0c27b739aae597a6911cf3b2ddd01f822 Reviewed-on: http://gerrit.cloudera.org:8080/5791 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          krusty_impala_b34c Christian Grassi added a comment -

          We have exactly the same issue. Our BI uses to_date() and it slows down the query by an order of magnitude. Is there any progress?
          I think we should rise priority as it affect a lot of usecases.

          Show
          krusty_impala_b34c Christian Grassi added a comment - We have exactly the same issue. Our BI uses to_date() and it slows down the query by an order of magnitude. Is there any progress? I think we should rise priority as it affect a lot of usecases.
          Hide
          grahn Greg Rahn added a comment -

          Using this for a UDF (and there may be better/faster ways), it looks like it 10x faster that the current implementation (assuming from_unixtime and to_date share common code).

          StringVal IntDate2StringDate(FunctionContext* context, const IntVal& arg1)
          {
              if (arg1.is_null) return StringVal::null();
          
              const time_t now = (time_t) 86400 * arg1.val;
              struct tm    ts;
              char         buf[80];
              gmtime_r(&now, &ts);
              std::strftime(buf, sizeof(buf), "%Y-%m-%d", &ts);
          
              std::string date_str = std::string(buf);
              StringVal result(context, date_str.size());
              memcpy(result.ptr, date_str.c_str(), date_str.size());
              return result;
          }
          
          -- 500,000,000 rows
          describe li2
          +---------------------+--------+---------+
          | name                | type   | comment |
          +---------------------+--------+---------+
          | l_commitdate_string | string |         |
          | l_commitdate_int32  | int    |         |
          | l_linenumber        | bigint |         |
          +---------------------+--------+---------+
          
          select * from li2 limit 5
          +---------------------+--------------------+--------------+
          | l_commitdate_string | l_commitdate_int32 | l_linenumber |
          +---------------------+--------------------+--------------+
          | 1992-05-30          | 8185               | 24651866     |
          | 1992-04-08          | 8133               | 21668240     |
          | 1992-04-03          | 8128               | 1077017      |
          | 1992-04-13          | 8138               | 5854227      |
          | 1992-06-07          | 8193               | 21622888     |
          +---------------------+--------------------+--------------+
          
          select l_commitdate_int32, sum(l_linenumber) from li2 group by 1 order by 1;
          Fetched 2466 row(s) in 11.89s
          
          select l_commitdate_string, sum(l_linenumber) from li2 group by 1 order by 1;
          Fetched 2466 row(s) in 23.25s
          
          select int2date(l_commitdate_int32) as l_commitdate_date, sum(l_linenumber) from li2 group by 1 order by 1;
          Fetched 2466 row(s) in 117.39s
          
          select from_unixtime(l_commitdate_int32 * 86400), sum(l_linenumber) from li2 group by 1 order by 1;
          Fetched 2466 row(s) in 1167.77s
          
          Show
          grahn Greg Rahn added a comment - Using this for a UDF (and there may be better/faster ways), it looks like it 10x faster that the current implementation (assuming from_unixtime and to_date share common code). StringVal IntDate2StringDate(FunctionContext* context, const IntVal& arg1) { if (arg1.is_null) return StringVal:: null (); const time_t now = (time_t) 86400 * arg1.val; struct tm ts; char buf[80]; gmtime_r(&now, &ts); std::strftime(buf, sizeof(buf), "%Y-%m-%d" , &ts); std::string date_str = std::string(buf); StringVal result(context, date_str.size()); memcpy(result.ptr, date_str.c_str(), date_str.size()); return result; } -- 500,000,000 rows describe li2 +---------------------+--------+---------+ | name | type | comment | +---------------------+--------+---------+ | l_commitdate_string | string | | | l_commitdate_int32 | int | | | l_linenumber | bigint | | +---------------------+--------+---------+ select * from li2 limit 5 +---------------------+--------------------+--------------+ | l_commitdate_string | l_commitdate_int32 | l_linenumber | +---------------------+--------------------+--------------+ | 1992-05-30 | 8185 | 24651866 | | 1992-04-08 | 8133 | 21668240 | | 1992-04-03 | 8128 | 1077017 | | 1992-04-13 | 8138 | 5854227 | | 1992-06-07 | 8193 | 21622888 | +---------------------+--------------------+--------------+ select l_commitdate_int32, sum(l_linenumber) from li2 group by 1 order by 1; Fetched 2466 row(s) in 11.89s select l_commitdate_string, sum(l_linenumber) from li2 group by 1 order by 1; Fetched 2466 row(s) in 23.25s select int2date(l_commitdate_int32) as l_commitdate_date, sum(l_linenumber) from li2 group by 1 order by 1; Fetched 2466 row(s) in 117.39s select from_unixtime(l_commitdate_int32 * 86400), sum(l_linenumber) from li2 group by 1 order by 1; Fetched 2466 row(s) in 1167.77s
          Hide
          shant_impala_2f6e Shant Hovsepian added a comment -

          From our own testing std:locale() is extremely expensive. Checking the locale for every function invocation doesn't make a lot of sense and depending on gcc/boost versions std::locale() may hold a global mutex or a refcount an atomic variable.

          Show
          shant_impala_2f6e Shant Hovsepian added a comment - From our own testing std:locale() is extremely expensive. Checking the locale for every function invocation doesn't make a lot of sense and depending on gcc/boost versions std::locale() may hold a global mutex or a refcount an atomic variable.
          Hide
          jbapple Jim Apple added a comment -

          Looks to me like Boost's to_iso_extended_string which calls to_date which calls ymd_to_string which calls imbue and creates a basic_ostringstream:

                std::basic_ostringstream<charT> ss;
          
                // Temporarily switch to classic locale to prevent possible formatting                      
                // of year with comma or other character (for example 2,008).                               
                ss.imbue(std::locale::classic());
                ss << ymd.year;
                ss.imbue(std::locale());
          

          Maybe we could just use strftime?

          Show
          jbapple Jim Apple added a comment - Looks to me like Boost's to_iso_extended_string which calls to_date which calls ymd_to_string which calls imbue and creates a basic_ostringstream : std::basic_ostringstream<charT> ss; // Temporarily switch to classic locale to prevent possible formatting // of year with comma or other character (for example 2,008). ss.imbue(std::locale::classic()); ss << ymd.year; ss.imbue(std::locale()); Maybe we could just use strftime ?
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Hot functions.

          Function / Call Stack CPU Time Module Function (Full) Source File Start Address
          std::basic_ios<char, std::char_traits<char>>::imbue 7.000s libstdc++.so.6 std::basic_ios<char, std::char_traits<char>>::imbue(std::locale const&) 0x81840
          std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char>>::basic_ostringstream 5.940s libstdc++.so.6 std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char>>::basic_ostringstream(std::_Ios_Openmode) 0xa1340
          std::ostream::_M_insert<unsigned long> 2.000s libstdc++.so.6 std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long) 0x9c1d0
          std::locale::~locale 1.180s libstdc++.so.6 std::locale::~locale(void) 0x74120
          __pread64 0.537s libpthread.so.0 __pread64 0x357f20f100
          impala::ScalarFnCall::EvaluateChildren 0.450s impalad impala::ScalarFnCall::EvaluateChildren(impala::ExprContext*, impala::TupleRow*, std::vector<impala_udf::AnyVal*, std::allocator<impala_udf::AnyVal*>>*) 0x7e5050
          memcpy 0.431s libc.so.6 memcpy 0x357ee89800
          impala::ExprContext::GetValue 0.330s impalad impala::ExprContext::GetValue(impala::Expr*, impala::TupleRow*) 0x7a8710
          std::locale::locale 0.270s libstdc++.so.6 std::locale::locale(void) 0x75b40
          std::__ostream_insert<char, std::char_traits<char>> 0.250s libstdc++.so.6 std::basic_ostream<char, std::char_traits<char>>& std::__ostream_insert<char, std::char_traits<char>>(std::basic_ostream<char, std::char_traits<char>>&, char const*, long) 0x9b9a0
          impala::TimestampParser::ParseDateTime 0.240s impalad impala::TimestampParser::ParseDateTime(char const*, int, impala::DateTimeFormatContext const&, impala::DateTimeParseResult*) 0x963d40
          impala::TimestampParser::Parse 0.210s impalad impala::TimestampParser::Parse(char const*, int, boost::gregorian::date*, boost::posix_time::time_duration*) 0x968a40
          impala::HdfsRCFileScanner::ProcessRange 0.200s impalad impala::HdfsRCFileScanner::ProcessRange(void) 0xb5d990
          std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>>::str 0.170s libstdc++.so.6 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>>::str(void) const 0xa0470
          boost::date_time::ymd_formatter<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, boost::date_time::iso_extended_format<char>, char>::ymd_to_string 0.150s impalad boost::date_time::ymd_formatter<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, boost::date_time::iso_extended_format<char>, char>::ymd_to_string(boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>) 0x7ca660
          impala::FunctionContextImpl::AllocateLocal 0.140s impalad impala::FunctionContextImpl::AllocateLocal(int) 0xa888d0
          impala::TimestampParser::Parse 0.130s impalad impala::TimestampParser::Parse(char const*, int, impala::DateTimeFormatContext const&, boost::gregorian::date*, boost::posix_time::time_duration*) 0x9682a0
          impala::TimestampFunctions::ToDate 0.120s impalad impala::TimestampFunctions::ToDate(impala_udf::FunctionContext*, impala_udf::TimestampVal const&) 0x7c3650
          operator delete 0.110s impalad operator delete(void*) 0x16d8340
          tc_new 0.090s impalad tc_new 0x16d7a30
          Show
          mmokhtar Mostafa Mokhtar added a comment - Hot functions. Function / Call Stack CPU Time Module Function (Full) Source File Start Address std::basic_ios<char, std::char_traits<char>>::imbue 7.000s libstdc++.so.6 std::basic_ios<char, std::char_traits<char>>::imbue(std::locale const&) 0x81840 std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char>>::basic_ostringstream 5.940s libstdc++.so.6 std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char>>::basic_ostringstream(std::_Ios_Openmode) 0xa1340 std::ostream::_M_insert<unsigned long> 2.000s libstdc++.so.6 std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long) 0x9c1d0 std::locale::~locale 1.180s libstdc++.so.6 std::locale::~locale(void) 0x74120 __pread64 0.537s libpthread.so.0 __pread64 0x357f20f100 impala::ScalarFnCall::EvaluateChildren 0.450s impalad impala::ScalarFnCall::EvaluateChildren(impala::ExprContext*, impala::TupleRow*, std::vector<impala_udf::AnyVal*, std::allocator<impala_udf::AnyVal*>>*) 0x7e5050 memcpy 0.431s libc.so.6 memcpy 0x357ee89800 impala::ExprContext::GetValue 0.330s impalad impala::ExprContext::GetValue(impala::Expr*, impala::TupleRow*) 0x7a8710 std::locale::locale 0.270s libstdc++.so.6 std::locale::locale(void) 0x75b40 std::__ostream_insert<char, std::char_traits<char>> 0.250s libstdc++.so.6 std::basic_ostream<char, std::char_traits<char>>& std::__ostream_insert<char, std::char_traits<char>>(std::basic_ostream<char, std::char_traits<char>>&, char const*, long) 0x9b9a0 impala::TimestampParser::ParseDateTime 0.240s impalad impala::TimestampParser::ParseDateTime(char const*, int, impala::DateTimeFormatContext const&, impala::DateTimeParseResult*) 0x963d40 impala::TimestampParser::Parse 0.210s impalad impala::TimestampParser::Parse(char const*, int, boost::gregorian::date*, boost::posix_time::time_duration*) 0x968a40 impala::HdfsRCFileScanner::ProcessRange 0.200s impalad impala::HdfsRCFileScanner::ProcessRange(void) 0xb5d990 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>>::str 0.170s libstdc++.so.6 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>>::str(void) const 0xa0470 boost::date_time::ymd_formatter<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, boost::date_time::iso_extended_format<char>, char>::ymd_to_string 0.150s impalad boost::date_time::ymd_formatter<boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>, boost::date_time::iso_extended_format<char>, char>::ymd_to_string(boost::date_time::year_month_day_base<boost::gregorian::greg_year, boost::gregorian::greg_month, boost::gregorian::greg_day>) 0x7ca660 impala::FunctionContextImpl::AllocateLocal 0.140s impalad impala::FunctionContextImpl::AllocateLocal(int) 0xa888d0 impala::TimestampParser::Parse 0.130s impalad impala::TimestampParser::Parse(char const*, int, impala::DateTimeFormatContext const&, boost::gregorian::date*, boost::posix_time::time_duration*) 0x9682a0 impala::TimestampFunctions::ToDate 0.120s impalad impala::TimestampFunctions::ToDate(impala_udf::FunctionContext*, impala_udf::TimestampVal const&) 0x7c3650 operator delete 0.110s impalad operator delete(void*) 0x16d8340 tc_new 0.090s impalad tc_new 0x16d7a30

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              grahn Greg Rahn
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development