Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36786

SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.1, 3.1.2
    • None
    • SQL


      Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

      The aim is to improve the compile time performance of query which in WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.

      To explain the problem, I will provide the context.

      The query plan in our production system, is huge, with nested case when expressions ( level of nesting could be >  8) , where each case when can have branches sometimes > 1000.

      The plan could look like

        Filter 1




      Now the optimizer has a Batch of Rules , intended to run at max 100 times.

      Also note that the, the batch will continue to run till one of the condition is satisfied

      i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is achieved)

      One of the early  Rule is   PushDownPredicateRule.

      *Followed by **CollapseProject*.


      The first issue is PushDownPredicate rule.

      It picks  one filter at a time & pushes it at lowest level ( I understand that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but either case it picks 1 filter at time starting from top, in each iteration.

      The above comment is no longer true in 3.1 release as it now combines filters. so it does push now all the encountered filters in a single pass. But it still materializes the filter on each push by realiasing.

      So if there are say  50 projects interspersed with Filters , the idempotency is guaranteedly not going to get achieved till around 49 iterations. Moreover, CollapseProject will also be modifying tree on each iteration as a filter will get removed within Project.

      Moreover, on each movement of filter through project tree, the filter is re-aliased using transformUp rule.  transformUp is very expensive compared to transformDown. As the filter keeps getting pushed down , its size increases.

      To optimize this rule , 2 things are needed

      1. Instead of pushing one filter at a time,  collect all the filters as we traverse the tree in that iteration itself.
      2. Do not re-alias the filters on each push. Collect the sequence of projects it has passed through, and  when the filters have reached their resting place, do the re-alias by processing the projects collected in down to up manner.

      This will result in achieving idempotency in a couple of iterations. 

      How reducing the number of iterations help in performance

      There are many rules like NullPropagation, OptimizeIn, SimplifyConditionals ( ... there are around 6 more such rules)  which traverse the tree using transformUp, and they run unnecessarily in each iteration , even when the expressions in an operator have not changed since the previous runs.

      I have a different proposal which I will share later, as to how to avoid the above rules from running unnecessarily, if it can be guaranteed that the expression is not going to mutate in the operator. 

      The cause of our huge compilation time has been identified as the above.

      Q2. What problem is this proposal NOT designed to solve?

      It is not going to change any runtime profile.

      Q3. How is it done today, and what are the limits of current practice?

      Like mentioned above , currently PushDownPredicate pushes one filter at a time  & at each Project , it materialized the re-aliased filter.  This results in large number of iterations to achieve idempotency as well as immediate materialization of Filter after each Project pass,, results in unnecessary tree traversals of filter expression that too using transformUp. and the expression tree of filter is bound to keep increasing as it is pushed down.

      Q4. What is new in your approach and why do you think it will be successful?

      In the new approach we push all the filters down in a single pass. And do not materialize filters as it pass through Project. Instead keep collecting projects in sequential order and materialize the final filter once its final position is achieved ( above a join , in case of 2.1 , or above the base relation etc).

      This approach when coupled with the logic of identifying those Project operator whose expressions will not mutate ( which I will share later) , so that rules like 


      are applied only in first pass on the expressions of that Project operator,  the compilation time of offending queries have been reduced to under 8 mins from 24 hrs or more.


      Q5. Who cares? If you are successful, what difference will it make?

      For My company WorkDay, it will solve the currently failing plans due to OOM & compilation time running into 24 hrs or so. I have a PR for this locally, will publish it in some  time.


      Q6. What are the risks?

      The risk in the change of PushDownPredicate is very low.
      For the next proposal of identifying Project operator whose expressions will be immutable such that the above set of rules run only once has some relatively complex logic but with extra tests coverage it should be safe.

      Q7. How long will it take?

      The basic changes are already in place. tests will take time. around 10 -15 days.

      Q8. What are the mid-term and final “exams” to check for success?

      All tests should pass.
      The perf benefit should justify the changes.




            Unassigned Unassigned
            ashahid7 Asif
            Arnaud Doucet Arnaud Doucet
            1 Vote for this issue
            12 Start watching this issue