Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15580

Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • Spark
    • None

    Description

      Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. For orderBy, Hive accumulates key groups using ArrayList (described in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, which has a shortcoming of not being able to spill to disk within a key group. Thus, for large key group, memory usage is also unbounded.

      It's likely that this will impact performance. We will profile and optimize afterwards. We could also make this change configurable.

      Attachments

        1. HIVE-15580.patch
          5 kB
          Xuefu Zhang
        2. HIVE-15580.1.patch
          8 kB
          Xuefu Zhang
        3. HIVE-15580.1.patch
          8 kB
          Xuefu Zhang
        4. HIVE-15580.2.patch
          15 kB
          Xuefu Zhang
        5. HIVE-15580.2.patch
          15 kB
          Xuefu Zhang
        6. HIVE-15580.3.patch
          21 kB
          Xuefu Zhang
        7. HIVE-15580.4.patch
          21 kB
          Xuefu Zhang
        8. HIVE-15580.5.patch
          37 kB
          Xuefu Zhang

        Issue Links

          Activity

            People

              xuefuz Xuefu Zhang
              xuefuz Xuefu Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: