Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-2144

reduce workload generated by JDBCStatsPublisher

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query.

        Attachments

        1. HIVE-2144.2.patch
          21 kB
          Tomasz Nykiel
        2. HIVE-2144.1.patch
          21 kB
          Tomasz Nykiel
        3. HIVE-2144.patch
          15 kB
          Tomasz Nykiel

          Activity

            People

            • Assignee:
              tnykiel Tomasz Nykiel
              Reporter:
              nzhang Ning Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: