Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
5.0-alpha
-
None
Description
During pushing down the query, KE will try to calculate the included data size to set Spark partitions, but if there were too many files on HDFS, it will take a lot of time to complete.
So in order to improve this situation, the following things will be done:
- Using a limited thread pool to calculate the data size
- Add timeout for the calculation, so as to stop the query as soon as possible
- Add new properties:
kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3,the default Spark partition num
kylin.query.pushdown.auto-set-shuffle-partitions-timeout=30, the maximum timeout, 30 seconds by default, to calculate the data size in order to adjust the Spark partition num
After these changes, we can expected the query complete in a fixed duration.