Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1656

Improve cost function in DruidQuery to encourage early column pruning

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12.0
    • Component/s: druid
    • Labels:

      Description

      Consider below query -

      select "countryName", floor("time" to DAY), cast(count(*) as integer) as c
               from "wiki"
               where floor("time" to DAY) >= '1997-01-01 00:00:00' and          floor("time" to DAY) < '1997-09-01 00:00:00'
               group by "countryName", floor("time" TO DAY)
               order by c limit 5
      

      resulting Druid Query -

      {
        "queryType": "select",
        "dataSource": "wikiticker",
        "descending": false,
        "intervals": [
          "1900-01-09T00:00:00.000/2992-01-10T00:00:00.000"
        ],
        "dimensions": [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user"
        ],
        "metrics": [
          "count",
          "added",
          "deleted",
          "delta",
          "user_unique"
        ],
        "granularity": "all",
        "pagingSpec": {
          "threshold": 16384,
          "fromNext": true
        },
        "context": {
          "druid.query.fetch": false
        }
      }
      

      Note that the above druid query has extra dimensions which are not required.

        Activity

        Hide
        nishantbangarwa Nishant Bangarwa added a comment -

        working on this, will add a patch with test case soon.

        Show
        nishantbangarwa Nishant Bangarwa added a comment - working on this, will add a patch with test case soon.
        Hide
        julianhyde Julian Hyde added a comment -

        That's great - but please make it a PR rather than a patch.

        Show
        julianhyde Julian Hyde added a comment - That's great - but please make it a PR rather than a patch.
        Show
        nishantbangarwa Nishant Bangarwa added a comment - Jesus Camacho Rodriguez Julian Hyde please have a look at - https://github.com/apache/calcite/pull/382
        Hide
        julianhyde Julian Hyde added a comment -

        Nishant Bangarwa, A couple of review comments:

        • Please explain in the test case what you consider to be a "good" or "bad" plan. Otherwise future maintainers will break it.
        • The comment Cost of Select > GroupBy > Timeseries > TopN doesn't agree with the numbers (.1, .08, .04, .06).
        Show
        julianhyde Julian Hyde added a comment - Nishant Bangarwa , A couple of review comments: Please explain in the test case what you consider to be a "good" or "bad" plan. Otherwise future maintainers will break it. The comment Cost of Select > GroupBy > Timeseries > TopN doesn't agree with the numbers (.1, .08, .04, .06).
        Hide
        nishantbangarwa Nishant Bangarwa added a comment -

        Julian Hyde : Fixed above comments, one last pending discussion if about including number of fields to be queried as part of cost funtion of DruidQuery.
        I had a discussion with Jesus Camacho Rodriguez and it seems that DruidQuery may not be the best place to adjust cost based on reading more or less columns, It should ideally be part of a TableScan instead. But it seems that the existing cost model is around number of rows i.e cardinality of rows and not around the number of columns which need to be scanned. IMO number of columns being scanned is an important measure for Columnar databases and we should maybe also consider that for TableScan or have a new ColumnarTableScan that accounts for this. Any thoughts on this ?

        I think it may be ok to have the number of fields as part of DruidQuery for now until we improve our cost model to include number of columns being scanned also.
        Do you agree ?

        Show
        nishantbangarwa Nishant Bangarwa added a comment - Julian Hyde : Fixed above comments, one last pending discussion if about including number of fields to be queried as part of cost funtion of DruidQuery. I had a discussion with Jesus Camacho Rodriguez and it seems that DruidQuery may not be the best place to adjust cost based on reading more or less columns, It should ideally be part of a TableScan instead. But it seems that the existing cost model is around number of rows i.e cardinality of rows and not around the number of columns which need to be scanned. IMO number of columns being scanned is an important measure for Columnar databases and we should maybe also consider that for TableScan or have a new ColumnarTableScan that accounts for this. Any thoughts on this ? I think it may be ok to have the number of fields as part of DruidQuery for now until we improve our cost model to include number of columns being scanned also. Do you agree ?
        Hide
        julianhyde Julian Hyde added a comment -

        If I/O is the major concern then Jesus Camacho Rodriguez is right, we should focus on table scans. But there is also general processing cost. I have no objection to having a small component of the cost of a project or even a filter be due to the number of columns. Then if we hit a situation where, all else being equal, we have a choice between a 10 column project and a 100 column project, or indeed between a 10 column DruidQuery and a 100 column DruidQuery, then we'd choose the narrower one.

        Show
        julianhyde Julian Hyde added a comment - If I/O is the major concern then Jesus Camacho Rodriguez is right, we should focus on table scans. But there is also general processing cost. I have no objection to having a small component of the cost of a project or even a filter be due to the number of columns. Then if we hit a situation where, all else being equal, we have a choice between a 10 column project and a 100 column project, or indeed between a 10 column DruidQuery and a 100 column DruidQuery, then we'd choose the narrower one.
        Hide
        julianhyde Julian Hyde added a comment -

        Nishant Bangarwa, By the way, I'd like to look into the test that you assert is broken but my VirtualBox installation is broken currently, and therefore I can't run the DruidAdapterIT tests. Working on it...

        Show
        julianhyde Julian Hyde added a comment - Nishant Bangarwa , By the way, I'd like to look into the test that you assert is broken but my VirtualBox installation is broken currently, and therefore I can't run the DruidAdapterIT tests. Working on it...
        Hide
        nishantbangarwa Nishant Bangarwa added a comment -

        Julian Hyde Also, It seems strange that the VolcanoPlanner does not account into the cpu or io part of cost here and only considers rowCount - https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100-L100

        Can you provide some context here ?

        Show
        nishantbangarwa Nishant Bangarwa added a comment - Julian Hyde Also, It seems strange that the VolcanoPlanner does not account into the cpu or io part of cost here and only considers rowCount - https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100-L100 Can you provide some context here ?
        Hide
        julianhyde Julian Hyde added a comment -

        Simple is better.

        Totally ordered costs are much easier to handle than partially ordered costs.

        Show
        julianhyde Julian Hyde added a comment - Simple is better. Totally ordered costs are much easier to handle than partially ordered costs.
        Hide
        julianhyde Julian Hyde added a comment -

        +1

        I have staged this change onto my https://github.com/julianhyde/calcite/tree/1656-druid-prune branch, and added a linear interpolation function to make the cost function better behaved for small or large numbers of fields.

        On this branch I have also fixed the plan change due to CALCITE-1601.

        I will commit to master after Jesus Camacho Rodriguez has fixed DruidAdapterIT breakage caused by CALCITE-1661.

        Show
        julianhyde Julian Hyde added a comment - +1 I have staged this change onto my https://github.com/julianhyde/calcite/tree/1656-druid-prune branch, and added a linear interpolation function to make the cost function better behaved for small or large numbers of fields. On this branch I have also fixed the plan change due to CALCITE-1601 . I will commit to master after Jesus Camacho Rodriguez has fixed DruidAdapterIT breakage caused by CALCITE-1661 .
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/c90fddfb . Thanks for the PR, Nishant Bangarwa !
        Hide
        julianhyde Julian Hyde added a comment -

        Resolved in release 1.12.0 (2017-03-24).

        Show
        julianhyde Julian Hyde added a comment - Resolved in release 1.12.0 (2017-03-24).

          People

          • Assignee:
            nishantbangarwa Nishant Bangarwa
            Reporter:
            nishantbangarwa Nishant Bangarwa
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development