[HIVE-2144] reduce workload generated by JDBCStatsPublisher - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-2144.patch
19/May/11 22:11
15 kB
Tomasz Nykiel
HIVE-2144.2.patch
23/May/11 22:08
21 kB
Tomasz Nykiel
HIVE-2144.1.patch
21/May/11 01:52
21 kB
Tomasz Nykiel

Activity

People

Assignee:: Tomasz Nykiel

Reporter:: Ning Zhang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/May/11 00:17

Updated:: 16/Dec/11 23:55

Resolved:: 24/May/11 19:18