Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6134

Merging small files based on file size only works for CTAS queries

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.0, 0.10.0, 0.11.0, 0.12.0
    • None
    • None
    • None

    Description

      According to the documentation, if we set hive.merge.mapfiles to true, Hive will launch an additional MR job to merge the small output files at the end of a map-only job when the average output file size is smaller than hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles to true, Hive will merge the output files of a map-reduce job.

      My expectation is that this is true for all MR queries. However, my observation is that this is only true for CTAS queries. In GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a regular SELECT query that doesn't have move tasks, these properties are not used.

      Is my understanding correct and if so, what's the reasoning behind the logic of not supporting this for regular SELECT queries? It seems to me that this should be supported for regular SELECT queries as well. One scenario where this hits us hard is when users try to download the result in HUE, and HUE times out b/c there are thousands of output files. The workaround is to re-run the query as CTAS, but it's a significant time sink.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ericchu30 Eric Chu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: