Some queries are very slow in compile time, for example following query
takes around 120 seconds to compile in hive 1.1 when
and hive is not in test mode.
All the above tables are tables with one column as partition. But all the tables are empty table. If the tables are not empty, it is reported that the compile so slow that it looks like hive is hanging.
In hive 2.0, the compile is much faster, explain takes 6.6 seconds. But it is still a lot of time. One of the problem slows ppd down is that list in pushdownPreds can grow very large which makes extractPushdownPreds bad performance:
During run the query above, in the following break point preds has size of 12051, and most entry of the list is: GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), ....
Following code in extractPushdownPreds will clone all the nodes in preds and do the walk. Hive 2.0 is faster because
HIVE-11652(and other jiras) makes startWalking much faster, but we still clone thousands of nodes with same expression. Should we store so many same predicates in the list or just one is good enough?
Should we change java/org/apache/hadoop/hive/ql/ppd/ExprWalkerInfo.java
public void addFinalCandidate(String alias, ExprNodeDesc expr)
public void addPushDowns(String alias, List<ExprNodeDesc> pushDowns)
to only add expr which is not in the PushDown list for an alias?