Uploaded image for project: 'Hivemall'
  1. Hivemall
  2. HIVEMALL-76

[SPARK] each_top_k behavior on Spark is wrong

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      I found that each_top_k behavior on Spark is little bit difference one from Hive for the ranking scheme in
      https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/tools/EachTopKUDTF.java#L198

      Hive provides a dense_rank but Spark does not.
      https://github.com/apache/incubator-hivemall/blob/72d6a629f972abc2f38c63d20fe5c978618f8bf8/spark/spark-2.0/src/main/scala/org/apache/spark/sql/catalyst/expressions/EachTopK.scala#L101

      Better to return a same rank where compared score is same.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maropu Takeshi Yamamuro
                Reporter:
                myui Makoto Yui
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: