Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15636

Hive/Druid integration: wrong semantics of topN query limit with granularity

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 3.0.0
    • Component/s: Druid integration
    • Labels:
      None
    • Target Version/s:

      Description

      Semantics of Druid topN query with limit and granularity is not equivalent to input SQL. In particular, limit is applied on each granularity value, not on the overall query.

      Currently, the following query will be transformed into a topN query:

      SELECT i_brand_id, floor_day(`__time`), max(ss_quantity), sum(ss_wholesale_cost) as s
      FROM store_sales_sold_time_subset
      GROUP BY i_brand_id, floor_day(`__time`)
      ORDER BY s DESC
      LIMIT 10;
      OK
      Plan optimized by CBO.
      
      Stage-0
        Fetch Operator
          limit:-1
          Stage-1
            Map 1 vectorized
            File Output Operator [FS_4]
              Select Operator [SEL_3] (rows=15888 width=0)
                Output:["_col0","_col1","_col2","_col3"]
                TableScan [TS_0] (rows=15888 width=0)
                  tpcds_druid_10@store_sales_sold_time_subset,store_sales_sold_time_subset,Tbl:PARTIAL,Col:NONE,Output:["i_brand_id","__time","$f2","$f3"],properties:{"druid.query.json":"{\"queryType\":\"topN\",\"dataSource\":\"druid_tpcds_ss_sold_time_subset\",\"granularity\":\"DAY\",\"dimension\":\"i_brand_id\",\"metric\":\"$f3\",\"aggregations\":[{\"type\":\"longMax\",\"name\":\"$f2\",\"fieldName\":\"ss_quantity\"},{\"type\":\"doubleSum\",\"name\":\"$f3\",\"fieldName\":\"ss_wholesale_cost\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"threshold\":10}","druid.query.type":"topN"}
      

      It outputs 300 rows, 10 per day. In turn, the equivalent SQL query for a Druid topN query should be expressed as:

      SELECT rs.i_brand_id, rs.d, rs.m, rs.s
      FROM (
          SELECT i_brand_id, floor_day(`__time`) as d, max(ss_quantity) as m, sum(ss_wholesale_cost) as s,
                 ROW_NUMBER() OVER (PARTITION BY floor_day(`__time`) ORDER BY sum(ss_wholesale_cost) DESC ) AS rownum
          FROM store_sales_sold_time_subset
          GROUP BY i_brand_id, floor_day(`__time`)
      ) rs
      WHERE rownum <= 10;
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                julianhyde Julian Hyde
                Reporter:
                jcamachorodriguez Jesus Camacho Rodriguez
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: