Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15580

Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3.0
    • Component/s: Spark
    • Labels:
      None

      Description

      Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. For orderBy, Hive accumulates key groups using ArrayList (described in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, which has a shortcoming of not being able to spill to disk within a key group. Thus, for large key group, memory usage is also unbounded.

      It's likely that this will impact performance. We will profile and optimize afterwards. We could also make this change configurable.

        Attachments

        1. HIVE-15580.1.patch
          8 kB
          Xuefu Zhang
        2. HIVE-15580.1.patch
          8 kB
          Xuefu Zhang
        3. HIVE-15580.2.patch
          15 kB
          Xuefu Zhang
        4. HIVE-15580.2.patch
          15 kB
          Xuefu Zhang
        5. HIVE-15580.3.patch
          21 kB
          Xuefu Zhang
        6. HIVE-15580.4.patch
          21 kB
          Xuefu Zhang
        7. HIVE-15580.5.patch
          37 kB
          Xuefu Zhang
        8. HIVE-15580.patch
          5 kB
          Xuefu Zhang

          Issue Links

            Activity

              People

              • Assignee:
                xuefuz Xuefu Zhang
                Reporter:
                xuefuz Xuefu Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: