Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.0.0, 4.0.0
-
None
-
None
Description
The Hive org.apache.hadoop.hive.ql.exec.Utilities.java file has taken on a life of its own. We should consider separating out the various components into their own classes. For this ticket, I propose separating out the getInputSummary functionality into its own class.
There are several issues with the current implementation:
- It is synchronized. Only one query can get file input summary at a time. For a query which deals with a large data set with a large number of files, this can block other queries for a long period of time. This is especially painful when most queries use a small data set, but a large data set is submitted on occasion.
- For each query, time is spend setting up and tearing down a ThreadPool
- It uses deprecated code
I propose breaking it out into its own class and creating a single thread pool that all queries pull from. In this way, the bottle neck will be one the number of available threads, not on a single query and if a big query is running and a small query is also submitted, the smaller query will be able to proceed.
In regards to setup/teardown... if a query uses 15 threads to perform this summary action, then finishes, it will tear down the threads, the next query may immediate create 15 new threads for processing. With a single pool, those threads are never performing tear down and setup.
Attachments
Attachments
Issue Links
- relates to
-
HIVE-21071 Improve getInputSummary
- Closed
- supercedes
-
HIVE-20395 Parallelize files move in the ql.metadata.Hive#replaceFiles
- Resolved