[SPARK-20107] Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: Documentation
Labels:
None

Description

Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 can speed up HadoopMapReduceCommitProtocol#commitJob for many output files.

It can speed up 11 minutes for 216869 output files:

CREATE TABLE tmp.spark_20107 AS SELECT
  category_id,
  product_id,
  track_id,
  concat(
    substr(ds, 3, 2),
    substr(ds, 6, 2),
    substr(ds, 9, 2)
  ) shortDate,
  CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type
FROM
  tmp.user_action
WHERE
  ds > date_sub('2017-01-23', 730)
AND actiontype IN ('0','1','2','3');

$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
216870

We should add this option to configuration.md.

All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: cloudera/hadoop-common@1c12361 and cloudera/hadoop-common@16b2de2) and apache's hadoop 2.7.0 or higher versions support this improvement.

Attachments

Issue Links

links to

[Github] Pull Request #17442 (wangyum)

Speed up FileOutputCommitter#commitJob for many output files

Update FileOutputCommitter.FILEOUTPUTCOMMITTER_ALGORITHM_VERSION_DEFAULT to match mapred-default.xml

Activity

People

Assignee:: Yuming Wang

Reporter:: Yuming Wang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Mar/17 13:00

Updated:: 04/Jul/17 17:40

Resolved:: 30/Mar/17 09:40