Details
Description
Semantics of Druid topN query with limit and granularity is not equivalent to input SQL. In particular, limit is applied on each granularity value, not on the overall query.
Currently, the following query will be transformed into a topN query:
SELECT i_brand_id, floor_day(`__time`), max(ss_quantity), sum(ss_wholesale_cost) as s FROM store_sales_sold_time_subset GROUP BY i_brand_id, floor_day(`__time`) ORDER BY s DESC LIMIT 10;
Previous query outputs at most 10 rows. In turn, the equivalent SQL query for a Druid topN query should be expressed as:
SELECT rs.i_brand_id, rs.d, rs.m, rs.s FROM ( SELECT i_brand_id, floor_day(`__time`) as d, max(ss_quantity) as m, sum(ss_wholesale_cost) as s, ROW_NUMBER() OVER (PARTITION BY floor_day(`__time`) ORDER BY sum(ss_wholesale_cost) DESC ) AS rownum FROM store_sales_sold_time_subset GROUP BY i_brand_id, floor_day(`__time`) ) rs WHERE rownum <= 10;
Issue Links
 relates to

CALCITE1591 Druid adapter: Use "groupBy" query with extractionFn for time dimension
 Open

HIVE15636 Hive/Druid integration: wrong semantics of topN query limit with granularity
 Resolved
Julian Hyde, could you doublecheck the alternative SQL formulation? Thanks