Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
HiveServer spends a lot of time serializing configuration objects. We can skip putting hadoop and tez config xml files in payload assuming that the configs are the same on both HS and Task side. This depends on Tez to load local xml configs when creating config objects https://issues.apache.org/jira/browse/TEZ-4137
Ideally we should be able to skip hive-site.xml too. However, if we skip hive-site.xml at that stage, then we make wrong choices at tez dag build stage due to missing configs.
In the ideal version of this, we should not be both looking up configs and putting new configs from and to the same config object at DAG and Vertex build phases. Instead we should be looking up from a HS2's HiveConf object and writing to a brand new JobConf for each vertex. That way we would not have any unnecessary item in the jobconf for any vertex. However Dag and Vertex build stages (TezTask#build) and a lot of other components called from there treat a single config object both the source of HS2 side config and the target JobConf that they are putting vertex level options into. It is very hard to separate these concerns now.
With this patch, we are reducing the size of JobConf (per vertex) by ~65%. It should improve the transmit latency. However, most significant gains are at CPU time while compressing job configs as the config objects are much smaller now.
Attachments
Attachments
Issue Links
- depends upon
-
TEZ-4137 Input/Output/Processor should merge payload to local conf
- Resolved
- links to