Hive
  1. Hive
  2. HIVE-629

concat files needed for map-reduce jobs also

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      HIVE-629. Concat files for map-reduce jobs. (Namit Jain via zshao)

      Description

      Currently, hive concatenates files only if the job under consideration is a map-only job.

      I got some requests from some users, where they want this behavior for map-reduce jobs also - it may not be a good idea to turn it on by default.
      But, we should provide an option to the user where the concatenation can happen even for map-reduce jobs.

      1. hive.629.2.patch
        10 kB
        Namit Jain
      2. hive.629.1.patch
        10 kB
        Namit Jain

        Activity

        Hide
        Zheng Shao added a comment -

        Committed. Thanks Namit!

        Show
        Zheng Shao added a comment - Committed. Thanks Namit!
        Hide
        Zheng Shao added a comment -

        That makes sense.

        If the user is doing a map-reduce job, we'd better not add such a task because the partitioning will be completely changed.

        In the future, we can do further optimizations to enable it for map-reduce job, only if the partition key and sort key is still available at the FileSink.

        Will test and commit.

        Show
        Zheng Shao added a comment - That makes sense. If the user is doing a map-reduce job, we'd better not add such a task because the partitioning will be completely changed. In the future, we can do further optimizations to enable it for map-reduce job, only if the partition key and sort key is still available at the FileSink. Will test and commit.
        Hide
        Namit Jain added a comment -

        I will change the name of the parameter and regenerate the patch.

        Show
        Namit Jain added a comment - I will change the name of the parameter and regenerate the patch.
        Hide
        Namit Jain added a comment -

        1. The number of reducers are determined from "hive.merge.size.per.task"
        2. We use random()
        3. I am not sure about that - by default, it should be disabled in case of a map-reduce job. Only in very specific cases, it is being turned on, where the user knows exactly what is he doing and can set the size appropriately.

        Show
        Namit Jain added a comment - 1. The number of reducers are determined from "hive.merge.size.per.task" 2. We use random() 3. I am not sure about that - by default, it should be disabled in case of a map-reduce job. Only in very specific cases, it is being turned on, where the user knows exactly what is he doing and can set the size appropriately.
        Hide
        Zheng Shao added a comment -

        Three more questions:

        1. How do we determine the number of reducers of the merge job? Is that based on "hive.exec.reducers.bytes.per.reducer"?
        2. How do we create the additional map-reduce job? Do we copy the cluster key or distribution key in the last job? If so, what if the keys are not available after the reducer?
        3. For default value, do we want to enable both (map, map-reduce), but set the threshold to 64MB or smaller like 16MB? So most users won't see a change at all, but people who are producing extremely small files (those are the people who wants this feature) will see the files concatenated?

        Show
        Zheng Shao added a comment - Three more questions: 1. How do we determine the number of reducers of the merge job? Is that based on "hive.exec.reducers.bytes.per.reducer"? 2. How do we create the additional map-reduce job? Do we copy the cluster key or distribution key in the last job? If so, what if the keys are not available after the reducer? 3. For default value, do we want to enable both (map, map-reduce), but set the threshold to 64MB or smaller like 16MB? So most users won't see a change at all, but people who are producing extremely small files (those are the people who wants this feature) will see the files concatenated?
        Hide
        Zheng Shao added a comment -

        The size configuration is still "hive.merge.size.per.mapper".
        Do you want to change it now since not many people are using it yet? "hive.merge.size.per.task".

        Show
        Zheng Shao added a comment - The size configuration is still "hive.merge.size.per.mapper". Do you want to change it now since not many people are using it yet? "hive.merge.size.per.task".
        Hide
        Zheng Shao added a comment -

        Agreed. If we turn it on for map-reduce job, the threshold should be much less - maybe 128MB per file or even 64MB per file.

        This makes sure it only happens for rare cases (so it does not affect the performance of most queries).

        Show
        Zheng Shao added a comment - Agreed. If we turn it on for map-reduce job, the threshold should be much less - maybe 128MB per file or even 64MB per file. This makes sure it only happens for rare cases (so it does not affect the performance of most queries).

          People

          • Assignee:
            Namit Jain
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development