Uploaded image for project: 'Hivemall'
  1. Hivemall
  2. HIVEMALL-138

Implement to_top_k_ordered_map

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      As an alternative "each_top_k" functionality, let us implement "to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.

      According to Makoto Yui:

      A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed and memory consumption is not reduced.

      to_top_k_ordered_map will become O(|article_id|*k) (or, O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space complexity in an operator. each_top_k internally uses priority queue (not sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. Shuffle involves a scalable external sort and memory space complexity can be avoided.

        Attachments

          Activity

            People

            • Assignee:
              takuti Takuya Kitazawa
              Reporter:
              takuti Takuya Kitazawa
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: