[HIVEMALL-138] Implement to_top_k_ordered_map - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.5.0
Labels:
None

Description

As an alternative "each_top_k" functionality, let us implement "to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.

According to Makoto Yui:

A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed and memory consumption is not reduced.

to_top_k_ordered_map will become O(|article_id|*k) (or, O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space complexity in an operator. each_top_k internally uses priority queue (not sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. Shuffle involves a scalable external sort and memory space complexity can be avoided.

Attachments

Issue Links

links to

GitHub Pull Request #108

GitHub Pull Request #114

Activity

People

Assignee:: Takuya Kitazawa

Reporter:: Takuya Kitazawa

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Aug/17 02:50

Updated:: 09/Nov/17 03:33

Resolved:: 13/Sep/17 12:14

Agile

View on Board

Implement to_top_k_ordered_map