Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3316

convert_legacy_hive_parquet_utc_timestamps=true makes reading parquet tables 30x slower

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: impala 2.3
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
      None
    • Environment:
      CDH 5.5.2/ Impala 2.3
      Parquet table with a timestamp column
      Secure cluster
      convert_legacy_hive_parquet_utc_timestamps=true
      Timestamp column is not being filtered on

      Description

      Enabling convert_legacy_hive_parquet_utc_timestamps=true
      makes simple queries that don't even filter on a timestamp attribute perform really poorly.

      Parquet table.
      Impala 2.3 / CDH 5.5.2.

      convert_legacy_hive_parquet_utc_timestamps=true makes following simple query 30x slower (1.1minutes -> over 30 minutes).

      select * from parquet_table_with_a_timestamp_attribute where bigint_attribute=1000771658169

      Notice I did not even filter on a timestamp attribute.

      Made multiple tests with and without convert_legacy_hive_parquet_utc_timestamps=true impalad present.

      Also, from https://issues.cloudera.org/browse/IMPALA-1658

      Casey Ching added a comment - 15/Jun/15 5:12 PM
      Btw, a perf test showed enabling this flag was 10x slower.

      1. screenshot-1.png
        9 kB
        Ruslan Dautkhanov
      2. screenshot-2.png
        19 kB
        Boris Tyukin

        Issue Links

          Activity

          Hide
          boristyukin Boris Tyukin added a comment -

          casey Ruslan Dautkhanov I did not realize how bad the impact of this would be on our system. I just tested this on our dev 6 node cluster, running 2 different queries. Both queries join 4 or 5 tables and one of the tables has 6B rows. Results are below. I just cannot believe how bad this is and why this JIRA has not got much attention from Cloudera. In fact, Cloudera's documentation states:

          • Although -convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead, Cloudera recommends turning it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid unexpected behavior.
            *

          We will be submitting a ticket for sure as we have enterprise CDH subscription. I have not seen anything impacting performance so bad in my entire IT career with such a harmless setting which is recommended by the vendor and neglected for more than a year. The double strike is that the issue happens only with Parquet files which is backed up by the same company and recommended to use with both Hive and Impala.

          Thanks Ruslan Dautkhanov for the workaround with using String instead of Timestamp - looks like it works great and date functions are still working as expected, granted I only spent a few minutes testing.

          Show
          boristyukin Boris Tyukin added a comment - casey Ruslan Dautkhanov I did not realize how bad the impact of this would be on our system. I just tested this on our dev 6 node cluster, running 2 different queries. Both queries join 4 or 5 tables and one of the tables has 6B rows. Results are below. I just cannot believe how bad this is and why this JIRA has not got much attention from Cloudera. In fact, Cloudera's documentation states: Although -convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead, Cloudera recommends turning it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid unexpected behavior. * We will be submitting a ticket for sure as we have enterprise CDH subscription. I have not seen anything impacting performance so bad in my entire IT career with such a harmless setting which is recommended by the vendor and neglected for more than a year. The double strike is that the issue happens only with Parquet files which is backed up by the same company and recommended to use with both Hive and Impala. Thanks Ruslan Dautkhanov for the workaround with using String instead of Timestamp - looks like it works great and date functions are still working as expected, granted I only spent a few minutes testing.
          Hide
          Tagar Ruslan Dautkhanov added a comment -

          Boris Tyukin, we store dates in `yyyy-MM-dd` format. Then Impala converts them to date/timestamp internally.
          Search for `yyyy-mm-dd` in https://www.cloudera.com/documentation/enterprise/latest/topics/impala_datetime_functions.html
          we still use this as a workaround.

          Show
          Tagar Ruslan Dautkhanov added a comment - Boris Tyukin , we store dates in `yyyy-MM-dd` format. Then Impala converts them to date/timestamp internally. Search for `yyyy-mm-dd` in https://www.cloudera.com/documentation/enterprise/latest/topics/impala_datetime_functions.html we still use this as a workaround.
          Hide
          boristyukin Boris Tyukin added a comment -

          Ruslan Dautkhanov did you find a good workaround? you mentioned converting timestamps to strings, how did it work for you?

          Show
          boristyukin Boris Tyukin added a comment - Ruslan Dautkhanov did you find a good workaround? you mentioned converting timestamps to strings, how did it work for you?
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          Good you find it useful.

          1)
          Hmm..- I do have those files in RHEL6- edit: I just noticed that I don't have PST file too.. weird, there are EST and MST files though:

          $ ll | egrep "[PMCE][SD]T"
          rw-r-r- 1 root root 2294 Oct 1 2015 CST6CDT
          rw-r-r- 1 root root 118 Oct 1 2015 EST
          rw-r-r- 1 root root 2294 Oct 1 2015 EST5EDT
          rw-r-r- 1 root root 118 Oct 1 2015 MST
          rw-r-r- 1 root root 2294 Oct 1 2015 MST7MDT
          rw-r-r- 1 root root 2294 Oct 1 2015 PST8PDT
          $ pwd
          /usr/share/zoneinfo

          I've submitted https://github.com/google/cctz/issues/22 to use zone.tab instead which should have complete set of timezones.
          For local-timezone-only translations, it would be easier to use just /etc/localtime? which normally points to a file (or a copy of a file) in /usr/share/zoneinfo .

          2)
          Good catch casey. I sent a note to the author. And submitted https://github.com/google/cctz/issues/23

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - Good you find it useful. 1) Hmm..- I do have those files in RHEL6- edit: I just noticed that I don't have PST file too.. weird, there are EST and MST files though: $ ll | egrep "[PMCE][SD]T" rw-r- r - 1 root root 2294 Oct 1 2015 CST6CDT rw-r- r - 1 root root 118 Oct 1 2015 EST rw-r- r - 1 root root 2294 Oct 1 2015 EST5EDT rw-r- r - 1 root root 118 Oct 1 2015 MST rw-r- r - 1 root root 2294 Oct 1 2015 MST7MDT rw-r- r - 1 root root 2294 Oct 1 2015 PST8PDT $ pwd /usr/share/zoneinfo I've submitted https://github.com/google/cctz/issues/22 to use zone.tab instead which should have complete set of timezones. For local-timezone-only translations, it would be easier to use just /etc/localtime? which normally points to a file (or a copy of a file) in /usr/share/zoneinfo . 2) Good catch casey . I sent a note to the author. And submitted https://github.com/google/cctz/issues/23
          Hide
          caseyc casey added a comment -

          Nice find! I took a quick look and it seems pretty nice. Two things that I saw

          1) Timezone abbreviation support seems strange. For example, I modified "example3"

          diff --git a/examples/example3.cc b/examples/example3.cc
          index 216acca..9734996 100644
          --- a/examples/example3.cc
          +++ b/examples/example3.cc
          @@ -20,7 +20,7 @@
          
           int main() {
             cctz::time_zone lax;
          -  load_time_zone("America/Los_Angeles", &lax);
          +  load_time_zone("PST", &lax);
          

          then when I run it there is an error

          $ ./example3
          /usr/share/zoneinfo/PST: No such file or directory
          ...
          

          That file really doesn't exist on my system. We'll have to think about that use case more.

          2) There is a lock in loading time zones at https://github.com/google/cctz/blob/master/src/time_zone_impl.cc#L64 . That should be much easier to work around though. Maybe we can load the zone at startup or something like that.

          I suspect we'll be able to use this though, thanks for sharing.

          Show
          caseyc casey added a comment - Nice find! I took a quick look and it seems pretty nice. Two things that I saw 1) Timezone abbreviation support seems strange. For example, I modified "example3" diff --git a/examples/example3.cc b/examples/example3.cc index 216acca..9734996 100644 --- a/examples/example3.cc +++ b/examples/example3.cc @@ -20,7 +20,7 @@ int main() { cctz::time_zone lax; - load_time_zone("America/Los_Angeles", &lax); + load_time_zone("PST", &lax); then when I run it there is an error $ ./example3 /usr/share/zoneinfo/PST: No such file or directory ... That file really doesn't exist on my system. We'll have to think about that use case more. 2) There is a lock in loading time zones at https://github.com/google/cctz/blob/master/src/time_zone_impl.cc#L64 . That should be much easier to work around though. Maybe we can load the zone at startup or something like that. I suspect we'll be able to use this though, thanks for sharing.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          Google's c++ library for time translations has solved this problem. See https://github.com/google/cctz

          In particular, for a cctz equivalent to localtime, use the cctz::BreakTime() function. For example, https://github.com/google/cctz/blob/master/examples/example3.cc
          People on stackoverflow say it doesn't have global lock problem.

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - Google's c++ library for time translations has solved this problem. See https://github.com/google/cctz In particular, for a cctz equivalent to localtime, use the cctz::BreakTime() function. For example, https://github.com/google/cctz/blob/master/examples/example3.cc People on stackoverflow say it doesn't have global lock problem.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          Facebook posted $100 bounty to solve localtime_r lock issue
          https://www.bountysource.com/issues/1326487-localtime_r-etc-holds-lock-via-__tz_convert

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - Facebook posted $100 bounty to solve localtime_r lock issue https://www.bountysource.com/issues/1326487-localtime_r-etc-holds-lock-via-__tz_convert
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          casey, I am glad you were able to find the root cause.
          If this is a global lock, then performance degradation will depend on how many cores/ threads you have. In our case we have 48 "cpus"/threads on some of the servers of the cluster:

          $ lscpu | head -9 | tail -6
          CPU(s): 48
          On-line CPU(s) list: 0-47
          Thread(s) per core: 2
          Core(s) per socket: 12
          Socket(s): 2
          NUMA node(s): 2

          This may explain difference between your 11x and our ~30x slowdown?

          Thank you for prodividing example on what you meant for trunc() as a workaround.
          Unfortunatally, it'll not work for us as we can't rewrite a third-party application.
          We started migrating data to string datatype in YYYY-MM-DD format as Impala implicitly converts dates
          in this format to timestamp for date functions etc..

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - casey , I am glad you were able to find the root cause. If this is a global lock, then performance degradation will depend on how many cores/ threads you have. In our case we have 48 "cpus"/threads on some of the servers of the cluster: $ lscpu | head -9 | tail -6 CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 This may explain difference between your 11x and our ~30x slowdown? Thank you for prodividing example on what you meant for trunc() as a workaround. Unfortunatally, it'll not work for us as we can't rewrite a third-party application. We started migrating data to string datatype in YYYY-MM-DD format as Impala implicitly converts dates in this format to timestamp for date functions etc..
          Hide
          caseyc casey added a comment -

          So it turns out that localtime_r calls a libc function __tz_convert that takes a global lock.

          (lock being taken) https://sourceware.org/git/?p=glibc.git;a=blob;f=time/tzset.c;h=f65116ce24577db3aca8d3d368c162ac71a8cd13;hb=HEAD#l616

          (impala calling localtime_r) https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/timestamp-value.cc#L79

          We should look into providing the same functionality without the locking.

          I did a local experiment with a 10B row table with the same schema and saw an 11x slowdown on our perf cluster with 24 cores. I think the original 5x number was from my desktop which has 8 cores and I was also using a smaller data set for sure.

          Ruslan Dautkhanov are you sure the truncation method wont work? I'm not sure what you mean about the day light savings time and the large range of values. For example if you are in pacific time, all your values should either have an hour of +7 or +8 depending on daylight savings. Then after truncation the values should be correct.

          Query: select timestamp_col_4 from table_1 limit 3
          +---------------------+
          | timestamp_col_4     |
          +---------------------+
          | 2011-08-18 07:00:00 |
          | 1996-08-18 07:00:00 |
          | 1996-03-20 08:00:00 |
          +---------------------+
          Fetched 3 row(s) in 0.26s
          
          Query: select trunc(timestamp_col_4, "DD") from table_1 limit 3
          +------------------------------+
          | trunc(timestamp_col_4, 'dd') |
          +------------------------------+
          | 2024-12-30 00:00:00          |
          | 1992-11-28 00:00:00          |
          | 2007-09-15 00:00:00          |
          +------------------------------+
          Fetched 3 row(s) in 0.25s
          

          I checked the truncation overhead and it should be very minimal.

          Show
          caseyc casey added a comment - So it turns out that localtime_r calls a libc function __tz_convert that takes a global lock. (lock being taken) https://sourceware.org/git/?p=glibc.git;a=blob;f=time/tzset.c;h=f65116ce24577db3aca8d3d368c162ac71a8cd13;hb=HEAD#l616 (impala calling localtime_r) https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/timestamp-value.cc#L79 We should look into providing the same functionality without the locking. I did a local experiment with a 10B row table with the same schema and saw an 11x slowdown on our perf cluster with 24 cores. I think the original 5x number was from my desktop which has 8 cores and I was also using a smaller data set for sure. Ruslan Dautkhanov are you sure the truncation method wont work? I'm not sure what you mean about the day light savings time and the large range of values. For example if you are in pacific time, all your values should either have an hour of +7 or +8 depending on daylight savings. Then after truncation the values should be correct. Query: select timestamp_col_4 from table_1 limit 3 +---------------------+ | timestamp_col_4 | +---------------------+ | 2011-08-18 07:00:00 | | 1996-08-18 07:00:00 | | 1996-03-20 08:00:00 | +---------------------+ Fetched 3 row(s) in 0.26s Query: select trunc(timestamp_col_4, "DD") from table_1 limit 3 +------------------------------+ | trunc(timestamp_col_4, 'dd') | +------------------------------+ | 2024-12-30 00:00:00 | | 1992-11-28 00:00:00 | | 2007-09-15 00:00:00 | +------------------------------+ Fetched 3 row(s) in 0.25s I checked the truncation overhead and it should be very minimal.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          casey, although I don't think it would work as offset is different depending if it's daylight saving time or not, and we have data spanning 20+ years.
          thanks for looking into the performance issue.

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - casey , although I don't think it would work as offset is different depending if it's daylight saving time or not, and we have data spanning 20+ years. thanks for looking into the performance issue.
          Hide
          caseyc casey added a comment -

          Since your timestamps use the date part only, I think there is a faster workaround. You can add a fixed interval to the date to compensate for the maximum offset, then truncate the hours. Depending on where you are, you might be able to just truncate value.

          Query: select trunc(now(), "DD")
          +---------------------+
          | trunc(now(), 'dd')  |
          +---------------------+
          | 2016-04-08 00:00:00 |
          +---------------------+
          Fetched 1 row(s) in 0.14s
          
          Query: select trunc(now() - interval 3 hours, "DD")
          +---------------------------------------+
          | trunc(now() - interval 3 hours, 'dd') |
          +---------------------------------------+
          | 2016-04-08 00:00:00                   |
          +---------------------------------------+
          Fetched 1 row(s) in 0.03s
          

          A view could be created so you don't have to type this every time you want to use the table.

          I'm still going to look into the performance issue though.

          Show
          caseyc casey added a comment - Since your timestamps use the date part only, I think there is a faster workaround. You can add a fixed interval to the date to compensate for the maximum offset, then truncate the hours. Depending on where you are, you might be able to just truncate value. Query: select trunc(now(), "DD") +---------------------+ | trunc(now(), 'dd') | +---------------------+ | 2016-04-08 00:00:00 | +---------------------+ Fetched 1 row(s) in 0.14s Query: select trunc(now() - interval 3 hours, "DD") +---------------------------------------+ | trunc(now() - interval 3 hours, 'dd') | +---------------------------------------+ | 2016-04-08 00:00:00 | +---------------------------------------+ Fetched 1 row(s) in 0.03s A view could be created so you don't have to type this every time you want to use the table. I'm still going to look into the performance issue though.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          Notice for 11billion rows, there are only 16k distinct timestamp values.
          That's because we store only date parts (time is always 00:00:00), and there are not so many distinct days/dates.
          Just an idea: You could call TimestampValue::UtcToLocal() only once for each distinct value and cache result in a in-memory map
          (it's not that much memory for 16k values).
          Speed up for TimestampValue::UtcToLocal() will be ~680k times (11billion/16k).
          Since column-level stats are available, you could have a thresholds like: number_of_rows/number_of_distinct_timestamp_values>e.g. 100
          and/or number_of_distinct_timestamp_values<128k - use above in-memory map.
          Also, that hash can be reused between different timestamp columns if there is more than one of them.
          It's a very common scenario where only date part is stored in a timestamp column.

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - Notice for 11billion rows, there are only 16k distinct timestamp values. That's because we store only date parts (time is always 00:00:00), and there are not so many distinct days/dates. Just an idea: You could call TimestampValue::UtcToLocal() only once for each distinct value and cache result in a in-memory map (it's not that much memory for 16k values). Speed up for TimestampValue::UtcToLocal() will be ~680k times (11billion/16k). Since column-level stats are available, you could have a thresholds like: number_of_rows/number_of_distinct_timestamp_values>e.g. 100 and/or number_of_distinct_timestamp_values<128k - use above in-memory map. Also, that hash can be reused between different timestamp columns if there is more than one of them. It's a very common scenario where only date part is stored in a timestamp column.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          This is rather narrow table.. but 11 billion rows. See attached "show column stats".
          There is just one timestamp column.

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - This is rather narrow table.. but 11 billion rows. See attached "show column stats". There is just one timestamp column.
          Hide
          caseyc casey added a comment -

          I'd have to investigate. Ruslan Dautkhanov would you be able to provide the data? If not maybe the schema with a rough idea of the data distribution?

          Show
          caseyc casey added a comment - I'd have to investigate. Ruslan Dautkhanov would you be able to provide the data? If not maybe the schema with a rough idea of the data distribution?
          Hide
          dhecht Dan Hecht added a comment -

          casey, over in IMPALA-2125 you said the overhead was 10x and reduced by that commit to 5x. Why is it 30x in this case?

          Regarding why the overhead is still seen even though the column is filtered is because of IMPALA-2017.

          Show
          dhecht Dan Hecht added a comment - casey , over in IMPALA-2125 you said the overhead was 10x and reduced by that commit to 5x. Why is it 30x in this case? Regarding why the overhead is still seen even though the column is filtered is because of IMPALA-2017 .
          Hide
          caseyc casey added a comment -

          Yes that is it. Out of curiosity how many timestamp columns does your table have? Since you query returns the values, the conversion needs to be done even though no filtering is done on a timestamp column.

          Show
          caseyc casey added a comment - Yes that is it. Out of curiosity how many timestamp columns does your table have? Since you query returns the values, the conversion needs to be done even though no filtering is done on a timestamp column.
          Hide
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment -

          TimestampValue::UtcToLocal is probably the culprit of the slowness.

          Show
          tagar_impala_e3b3 Ruslan Dautkhanov added a comment - TimestampValue::UtcToLocal is probably the culprit of the slowness.

            People

            • Assignee:
              attilaj Attila Jeges
              Reporter:
              tagar_impala_e3b3 Ruslan Dautkhanov
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development