Details
-
Improvement
-
Status: Resolved
-
Trivial
-
Resolution: Fixed
-
2.1.0
-
None
Description
Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 can speed up HadoopMapReduceCommitProtocol#commitJob for many output files.
It can speed up 11 minutes for 216869 output files:
CREATE TABLE tmp.spark_20107 AS SELECT category_id, product_id, track_id, concat( substr(ds, 3, 2), substr(ds, 6, 2), substr(ds, 9, 2) ) shortDate, CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type FROM tmp.user_action WHERE ds > date_sub('2017-01-23', 730) AND actiontype IN ('0','1','2','3');
$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l 216870
We should add this option to configuration.md.
All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: cloudera/hadoop-common@1c12361 and cloudera/hadoop-common@16b2de2) and apache's hadoop 2.7.0 or higher versions support this improvement.