Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22233

filter out empty InputSplit in HadoopRDD

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

      spark version:Spark 2.2
      master: yarn
      deploy-mode: cluster

      Description

      Sometimes, Hive will create an empty table with many empty files, Spark use the InputFormat stored in Hive Meta Store and will not combine the empty files and therefore generate many tasks to handle this empty files.
      Hive use CombineHiveInputFormat(hive.input.format) by default.
      So, in this case, Spark will spends much more resources than hive.

      2 suggestions:
      1. add a configuration, filter out empty InputSplit in HadoopRDD.
      2. add a configuration, user can customize the inputformatclass in HadoopTableReader.

        Attachments

          Activity

            People

            • Assignee:
              liutang123 Lijia Liu
              Reporter:
              liutang123 Lijia Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: