Hive
  1. Hive
  2. HIVE-439

merge small files after a map-only job

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      HIVE-439. Merge small files after a map-only job. (Namit Jain via zshao)

      Description

      There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if these small files are combined into fewer larger files.

      The problem can also be addressed by having a mapper span multiple blocks as in:

      https://issues.apache.org/jira/browse/HIVE-74

      Bit, it also makes sense in HIVE to merge files whenever possible.

      <property>
        <name>hive.merge.mapfiles</name>
        <value>true</value>
        <description>Merge small files at the end of the job</description>
      </property>
      
      <property>
        <name>hive.merge.size.per.task</name>
        <value>256000000</value>
        <description>Size of merged files at the end of the job</description>
      </property>
      
      1. hive.439.5.patch
        1.45 MB
        Namit Jain
      2. hive.439.4.patch
        1.45 MB
        Namit Jain
      3. hive.439.3.patch
        1.45 MB
        Namit Jain
      4. hive.439.2.patch
        1.45 MB
        Namit Jain
      5. hive.439.1.patch
        1.50 MB
        Namit Jain

        Activity

        Namit Jain created issue -
        Namit Jain made changes -
        Field Original Value New Value
        Attachment hive.439.1.patch [ 12411000 ]
        Namit Jain made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Namit Jain made changes -
        Attachment hive.439.2.patch [ 12411027 ]
        Namit Jain made changes -
        Attachment hive.439.3.patch [ 12411112 ]
        Namit Jain made changes -
        Attachment hive.439.4.patch [ 12411358 ]
        Zheng Shao made changes -
        Affects Version/s 0.3.0 [ 12313637 ]
        Summary merge small files whenever possible merge small files after a map-only job
        Fix Version/s 0.4.0 [ 12313714 ]
        Affects Version/s 0.3.1 [ 12313845 ]
        Namit Jain made changes -
        Attachment hive.439.5.patch [ 12411442 ]
        Zheng Shao made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Release Note HIVE-439. Merge small files after a map-only job. (Namit Jain via zshao)
        Resolution Fixed [ 1 ]
        Zheng Shao made changes -
        Description There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if these small files are combined into fewer larger files.

        The problem can also be addressed by having a mapper span multiple blocks as in:

        https://issues.apache.org/jira/browse/HIVE-74


        Bit, it also makes sense in HIVE to merge files whenever possible.
        There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if these small files are combined into fewer larger files.

        The problem can also be addressed by having a mapper span multiple blocks as in:

        https://issues.apache.org/jira/browse/HIVE-74


        Bit, it also makes sense in HIVE to merge files whenever possible.

        {code}
        <property>
          <name>hive.merge.mapfiles</name>
          <value>true</value>
          <description>Merge small files at the end of the job</description>
        </property>

        <property>
          <name>hive.merge.size.per.mapper</name>
          <value>1000000000</value>
          <description>Size of merged files at the end of the job</description>
        </property>
        {code}
        Zheng Shao made changes -
        Description There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if these small files are combined into fewer larger files.

        The problem can also be addressed by having a mapper span multiple blocks as in:

        https://issues.apache.org/jira/browse/HIVE-74


        Bit, it also makes sense in HIVE to merge files whenever possible.

        {code}
        <property>
          <name>hive.merge.mapfiles</name>
          <value>true</value>
          <description>Merge small files at the end of the job</description>
        </property>

        <property>
          <name>hive.merge.size.per.mapper</name>
          <value>1000000000</value>
          <description>Size of merged files at the end of the job</description>
        </property>
        {code}
        There are cases when the input to a Hive job are thousands of small files. In this case, there is a mapper for each file. Most of the overhead for spawning all these mappers can be avoided if these small files are combined into fewer larger files.

        The problem can also be addressed by having a mapper span multiple blocks as in:

        https://issues.apache.org/jira/browse/HIVE-74


        Bit, it also makes sense in HIVE to merge files whenever possible.

        {code}
        <property>
          <name>hive.merge.mapfiles</name>
          <value>true</value>
          <description>Merge small files at the end of the job</description>
        </property>

        <property>
          <name>hive.merge.size.per.task</name>
          <value>256000000</value>
          <description>Size of merged files at the end of the job</description>
        </property>
        {code}
        Carl Steinbach made changes -
        Affects Version/s 0.3.1 [ 12313845 ]
        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Namit Jain
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development