Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20107

Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 2.1.0
    • 2.2.0
    • Documentation
    • None

    Description

      Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 can speed up HadoopMapReduceCommitProtocol#commitJob for many output files.

      It can speed up 11 minutes for 216869 output files:

      CREATE TABLE tmp.spark_20107 AS SELECT
        category_id,
        product_id,
        track_id,
        concat(
          substr(ds, 3, 2),
          substr(ds, 6, 2),
          substr(ds, 9, 2)
        ) shortDate,
        CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type
      FROM
        tmp.user_action
      WHERE
        ds > date_sub('2017-01-23', 730)
      AND actiontype IN ('0','1','2','3');
      
      $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
      216870
      

      We should add this option to configuration.md.

      All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: cloudera/hadoop-common@1c12361 and cloudera/hadoop-common@16b2de2) and apache's hadoop 2.7.0 or higher versions support this improvement.

      Attachments

        Activity

          People

            yumwang Yuming Wang
            yumwang Yuming Wang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: