Uploaded image for project: 'Hivemall'
  1. Hivemall
  2. HIVEMALL-138

Implement to_top_k_ordered_map

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.5.0
    • None

    Description

      As an alternative "each_top_k" functionality, let us implement "to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.

      According to Makoto Yui:

      A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed and memory consumption is not reduced.

      to_top_k_ordered_map will become O(|article_id|*k) (or, O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space complexity in an operator. each_top_k internally uses priority queue (not sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. Shuffle involves a scalable external sort and memory space complexity can be avoided.

      Attachments

        Activity

          People

            takuti Takuya Kitazawa
            takuti Takuya Kitazawa
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment