Now I want to add a complete cost-based optimization for hive. but when I begin the work, I found it very difficult to do using current hive optimization framework. The current code of hive, optimizations are all done after generating DAG of operators. It is a awful design and makes me mad. For example, the map-side optimization, it scans the whole operators' DAG and try to find the operators that can be replaced by map-operation and then replace it. How terrible and stupid the code is!!! The terrible code expands to 1000 lines, and only implements the map-side optimizations!!!
In my opinion, optimization shouldn't be done in a separated step, differnt optimization should be done in appropriate time. For example, join reorder, should be done when we parse the input query, and we can generate Map-Reduce operators or only Map-Operator for each join according to the cost estimation. And, in the process, we can do join and aggreagation merge, and, we shoud push down predicate in proper time and generate proper data sturcture, to insure the cose-estimation module can fetch corresponding predicate of each base table for estimating JOIN cost. How concise and graceful the code will be if we do the optimization this way!!! But Now, in order to complying with the Optimiser framework of Hive, I have to write lots of ugly code with amazing redundancy, and, the code is very very difficult to debug!!!! Now there is a patch of cost-based JOIN reorder and merge optimizer called YSMART, I glance at it. It use 6000+ code and is difficult to read!! And it's optimization is incompleted.
The optimizer architecture of Hive is terrible, How can I do now?
|Labels||architecture optimizer ysmart||architecture optimizer|