Uploaded image for project: 'Hivemall'
  1. Hivemall
  2. HIVEMALL-44

Support Top-K joins for DataFrame/Spark

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Labels:

      Description

      In Hivemall, `each_top_k` is useful for practical use cases. On the other hand, there are some cases we need to join tables then compute Top-K entries.You know we can compute this query by using regular joins + `each_top_k`. However, we have space to improve this query more; that is, we compute Top-K entries while processing joins. This optimization avoids a substantial amount of I/O for joins.

      An example query is as follows;

      val inputDf = Seq(
        ("user1", 1, 0.3, 0.5),
        ("user2", 2, 0.1, 0.1),
        ("user3", 3, 0.8, 0.0),
        ("user4", 1, 0.9, 0.9),
        ("user5", 3, 0.7, 0.2),
        ("user6", 1, 0.5, 0.4),
        ("user7", 2, 0.6, 0.8)
      ).toDF("userId", "group", "x", "y")
      
      val masterDf = Seq(
        (1, "pos1-1", 0.5, 0.1),
        (1, "pos1-2", 0.0, 0.0),
        (1, "pos1-3", 0.3, 0.3),
        (2, "pos2-3", 0.1, 0.3),
        (2, "pos2-3", 0.8, 0.8),
        (3, "pos3-1", 0.1, 0.7),
        (3, "pos3-1", 0.7, 0.1),
        (3, "pos3-1", 0.9, 0.0),
        (3, "pos3-1", 0.1, 0.3)
      ).toDF("group", "position", "x", "y")
      
      // Compute top-1 rows for each group
      val distance = sqrt(
        pow(inputDf("x") - masterDf("x"), lit(2.0)) +
        pow(inputDf("y") - masterDf("y"), lit(2.0))
      )
      
      val top1Df = inputDf.join_top_k(
        lit(1), masterDf, inputDf("group") === masterDf("group"),
        distance.as("score")
      )
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maropu Takeshi Yamamuro
                Reporter:
                maropu Takeshi Yamamuro
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: