[HIVE-20270] Don't serialize hashCode for groupByKey - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Spark
Labels:
None

Target Version/s:

4.0.0

Description

Similar to ~~HIVE-20032~~, but for groupByKey. The tricky part with groupByKey is we need to preserve the hashCode until the key gets partitioned (via the HashPartitioner) but after that we don't really need to preserve the hashCode. The groupByKey operator in Spark does require a hashCode since it puts everything in a map, but it can use a different hash-code than the one specified in HiveKey. The hashcode in HiveKey is only important for determining the partition the key should be assigned to.

The drawback is that computing the hashcode for each HiveKey might require more CPU resources, but we should profile it just in case.

Attachments

Activity

There are no comments yet on this issue.

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Jul/18 14:58

Updated:: 30/Jul/18 14:58