[HIVE-3997] Use distributed cache to cache/localize dimension table & filter it in map task setup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

The hive clients are not always co-located with the hadoop/hdfs cluster.

This means that the dimension table filtering, when done on the client side becomes very slow. Not only that, the conversion of the small tables into hashtables has to be done every single time a query is run with different filters on the big table.

That entire hashtable has to be part of the job, which involves even more HDFS writes from the far client side.

Using the distributed cache also has the advantage that the localized files can be kept between jobs instead of firing off an HDFS read for every query.

Moving the operator pipeline for the hash generation into the map task itself has perhaps a few cons.

The map task might OOM due to this change, but it will take longer to recover until all the map attempts fail, instead of being conditional on the client. The client has no idea how much memory the hashtable needs and has to rely on the disk sizes (compressed sizes, perhaps) to determine if it needs to fall back onto a reduce-join instead.

Attachments

Activity

People

Assignee:: Gopal Vijayaraghavan

Reporter:: Gopal Vijayaraghavan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Feb/13 15:25

Updated:: 15/Nov/13 21:45

Resolved:: 15/Nov/13 21:45