Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.2.0
    • 0.2.0
    • None
    • None

    Description

      Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage.

      Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node.

      The same idea is applicable to solve variants of this problem:

      • An Algebraic Function for 'Top K Rows'
      • An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
      • TOP K ORDER BY.

      Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones.

      I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details.

      Attachments

        1. limit1.patch
          20 kB
          Daniel Dai
        2. limit2.patch
          29 kB
          Daniel Dai
        3. limit3.patch
          64 kB
          Daniel Dai

        Issue Links

          Activity

            People

              daijy Daniel Dai
              amirhyoussefi Amir Youssefi
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: