Description
In Hivemall, `each_top_k` is useful for practical use cases. On the other hand, there are some cases we need to join tables then compute Top-K entries.You know we can compute this query by using regular joins + `each_top_k`. However, we have space to improve this query more; that is, we compute Top-K entries while processing joins. This optimization avoids a substantial amount of I/O for joins.
An example query is as follows;
val inputDf = Seq( ("user1", 1, 0.3, 0.5), ("user2", 2, 0.1, 0.1), ("user3", 3, 0.8, 0.0), ("user4", 1, 0.9, 0.9), ("user5", 3, 0.7, 0.2), ("user6", 1, 0.5, 0.4), ("user7", 2, 0.6, 0.8) ).toDF("userId", "group", "x", "y") val masterDf = Seq( (1, "pos1-1", 0.5, 0.1), (1, "pos1-2", 0.0, 0.0), (1, "pos1-3", 0.3, 0.3), (2, "pos2-3", 0.1, 0.3), (2, "pos2-3", 0.8, 0.8), (3, "pos3-1", 0.1, 0.7), (3, "pos3-1", 0.7, 0.1), (3, "pos3-1", 0.9, 0.0), (3, "pos3-1", 0.1, 0.3) ).toDF("group", "position", "x", "y") // Compute top-1 rows for each group val distance = sqrt( pow(inputDf("x") - masterDf("x"), lit(2.0)) + pow(inputDf("y") - masterDf("y"), lit(2.0)) ) val top1Df = inputDf.join_top_k( lit(1), masterDf, inputDf("group") === masterDf("group"), distance.as("score") )
Attachments
Issue Links
- links to