Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.12.0
-
None
-
Ubuntu LXC 12.10
-
Avoid creating new HiveConf() within the FetchOperator loop
Description
While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files.
To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code
--- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException { * @return list of file status entries */ private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException { - HiveConf hiveConf = new HiveConf(job, FetchOperator.class); - boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE); + boolean recursive = false; if (!recursive) { return fs.listStatus(p); }
And re-ran my query to compare timings.
Before | After | |
---|---|---|
Cumulative CPU | 731.07 sec | 386.0 sec |
Total time | 347.66 seconds | 218.855 seconds |
The query used was
INSERT OVERWRITE LOCAL DIRECTORY
'/grid/0/smb/'
select inv_item_sk
from
inventory inv
join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
limit 100000
;
On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions.
78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached.
Attachments
Attachments
Issue Links
- relates to
-
HADOOP-9570 Configuration.addResource() should only parse the new resource
- Patch Available