Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-74

Hive can use CombineFileInputFormat for when the input are many small files

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.0
    • Query Processor
    • None
    • Reviewed
    • Hide
      HIVE-74. Hive can use CombineFileInputFormat for when the input has many
      small files (Namit Jain via rmurthy)
      Show
      HIVE-74 . Hive can use CombineFileInputFormat for when the input has many small files (Namit Jain via rmurthy)

    Description

      There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if Hive used CombineFileInputFormat introduced via HADOOP-4565

      Options to control this behavior:

      hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat (default, if empty), or org.apache.hadoop.hive.ql.io.HiveInputFormat)
      mapred.min.split.size.per.node (the minimum bytes of data to create a node-local partition, otherwise the data will combine to rack level. default:0)
      mapred.min.split.size.per.rack (the minimum bytes of data to create a rack-local partition, otherwise the data will combine to global level. default:0)
      mapred.max.split.size (the max size of each split, will be exceeded because we stop accumulating *after* reaching it, instead of before)
      

      The 3 numbers above must be in non-descending order.

      Attachments

        1. hiveCombineSplit2.patch
          15 kB
          Dhruba Borthakur
        2. hiveCombineSplit.patch
          14 kB
          Dhruba Borthakur
        3. hiveCombineSplit.patch
          15 kB
          Dhruba Borthakur
        4. hive.74.2.patch
          37 kB
          Namit Jain
        5. hive.74.1.patch
          36 kB
          Namit Jain

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rajesh.balamohan Rajesh Balamohan Assign to me
            dhruba Dhruba Borthakur
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment