Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-2363

Prune cuboids by capping number of dimensions

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      the scene like this:

      I have 20+ dimensions, However the query will only use at most 5 dimensions in all dimensions, so cuboid that contains 5+ dimensions(except base cuboid) is useless.

      I think we can add a configuration in cube, which limit the max dimensions that cuboid includes.

      What's more, we can config which level(number of dimension) need to calculate. in above scene, we only calculate leve 1,2,3,4,5. and skip level 5+

        Activity

        Hide
        yimingliu Billy Liu added a comment -

        kylin.cube.aggrgroup.max-combination is used to limit the dimension combination.

        Show
        yimingliu Billy Liu added a comment - kylin.cube.aggrgroup.max-combination is used to limit the dimension combination.
        Hide
        feng_xiao_yu fengYu added a comment - Reporter

        I know the configuration and what it means, In my scene, for example, I have 6 dimensions : A/B/C/D/E/F, I need at most 2 dimensions(in where and group by) in any query, so, I need calculate cuboids like AB/AC/AD/AE/AF/... and A/B/C/D/E/F and skip ABC/ABD/ACD/... and ABCD/ABCE...

        so I need a configuration specify the max dimensions that cuboid contains which need to be calculated.

        Show
        feng_xiao_yu fengYu added a comment - Reporter I know the configuration and what it means, In my scene, for example, I have 6 dimensions : A/B/C/D/E/F, I need at most 2 dimensions(in where and group by) in any query, so, I need calculate cuboids like AB/AC/AD/AE/AF/... and A/B/C/D/E/F and skip ABC/ABD/ACD/... and ABCD/ABCE... so I need a configuration specify the max dimensions that cuboid contains which need to be calculated.
        Hide
        Chuanlei Ni Chuanlei Ni added a comment -

        Maybe it is a possible optimization to specify the range of dimension numbers which the users will concern. For example, we will only query with [0, 5] dimensions in the group and where causes.

        Show
        Chuanlei Ni Chuanlei Ni added a comment - Maybe it is a possible optimization to specify the range of dimension numbers which the users will concern. For example, we will only query with [0, 5] dimensions in the group and where causes.
        Hide
        Shaofengshi Shaofeng SHI added a comment -

        This is a good idea; it can help to purning many "big" cuboids; With this feature, user can apply some rules in front-end, for example, only allow user picks up to 5 dimensions among total 20 dimensions to query.

        Show
        Shaofengshi Shaofeng SHI added a comment - This is a good idea; it can help to purning many "big" cuboids; With this feature, user can apply some rules in front-end, for example, only allow user picks up to 5 dimensions among total 20 dimensions to query.
        Hide
        feng_xiao_yu fengYu added a comment - Reporter

        Yes, that is what we want to do. I will working on it later.

        Show
        feng_xiao_yu fengYu added a comment - Reporter Yes, that is what we want to do. I will working on it later.
        Hide
        xiefan46 XIE FAN added a comment -

        I agree with this idea. And I think we can extend this idea a little: allow users to select exactly which cuboids they need in a visual way. Users can choose what cuboids they need in the front-end and only these cuboids will be materialized. For example, users can choose to calculate all the 1-D and 2-D cuboids and part of the 3-D, 4-D cuboids and exclude the other.

        Show
        xiefan46 XIE FAN added a comment - I agree with this idea. And I think we can extend this idea a little: allow users to select exactly which cuboids they need in a visual way. Users can choose what cuboids they need in the front-end and only these cuboids will be materialized. For example, users can choose to calculate all the 1-D and 2-D cuboids and part of the 3-D, 4-D cuboids and exclude the other.
        Hide
        feng_xiao_yu fengYu added a comment - Reporter

        yes, set a range or enumerate all levels to be calculated is a more user-friendly solution.

        Show
        feng_xiao_yu fengYu added a comment - Reporter yes, set a range or enumerate all levels to be calculated is a more user-friendly solution.
        Hide
        Internal
        xwhfcenter Copperfield added a comment -

        And maybe if we calculate from top?
        In my opinion, kylin build all cuboids from 6 dimensions to 1dimension, ABCDEF -> A/B/C/D/E/F.
        we can only build to 4 or 3 dimensions, ABCD/ABCE..,
        the when query, we can use these existing results to calculate the result we want. It may be quicker than tail to top?

        Show
        Internal
        xwhfcenter Copperfield added a comment - And maybe if we calculate from top? In my opinion, kylin build all cuboids from 6 dimensions to 1dimension, ABCDEF -> A/B/C/D/E/F. we can only build to 4 or 3 dimensions, ABCD/ABCE.., the when query, we can use these existing results to calculate the result we want. It may be quicker than tail to top?
        Hide
        yimingliu Billy Liu added a comment -

        Thanks fengYu, this is very useful.

        Show
        yimingliu Billy Liu added a comment - Thanks fengYu , this is very useful.
        Hide
        Internal
        Chuanlei Ni Chuanlei Ni added a comment - - edited

        If we only have the 4-D cuboids, but we need to query about 3-D information.
        Obviously, kylin cannot get the result from Hbase directly.
        But can kylin calculate the aggregation based on the 4-D cuboid on fly?

        Is this idea possible for the current kylin architecture which store the cube on hbase and do some post-processing via coprocessor?

        @Kylin experts

        Show
        Internal
        Chuanlei Ni Chuanlei Ni added a comment - - edited If we only have the 4-D cuboids, but we need to query about 3-D information. Obviously, kylin cannot get the result from Hbase directly. But can kylin calculate the aggregation based on the 4-D cuboid on fly? Is this idea possible for the current kylin architecture which store the cube on hbase and do some post-processing via coprocessor? @Kylin experts
        Hide
        liyang.gmt8@gmail.com liyang added a comment -

        Chuanlei Ni, sure Kylin will fallback to 4-D cuboid if requested 3-D cuboid is not available.

        Show
        liyang.gmt8@gmail.com liyang added a comment - Chuanlei Ni , sure Kylin will fallback to 4-D cuboid if requested 3-D cuboid is not available.
        Hide
        liyang.gmt8@gmail.com liyang added a comment -

        Copperfield, the base cuboid (has the most dimensions) is always calculated. Keep a few levels of cuboid from the base is possible, however those cuboids tend to be the biggest ones.

        Show
        liyang.gmt8@gmail.com liyang added a comment - Copperfield , the base cuboid (has the most dimensions) is always calculated. Keep a few levels of cuboid from the base is possible, however those cuboids tend to be the biggest ones.
        Hide
        R0ger Roger Shi added a comment -

        fengYu, you said you were working on this issue, how is it going?

        Show
        R0ger Roger Shi added a comment - fengYu , you said you were working on this issue, how is it going?
        Hide
        feng_xiao_yu fengYu added a comment - Reporter

        Roger Shi sorry for delay. I am waiting for the release of kylin 2.0, I want to add this feature beyond it, I think this week it will release and I will do this job.

        Show
        feng_xiao_yu fengYu added a comment - Reporter Roger Shi sorry for delay. I am waiting for the release of kylin 2.0, I want to add this feature beyond it, I think this week it will release and I will do this job.
        Hide
        R0ger Roger Shi added a comment -

        Hi, I have uploaded a design draft. Please let me know if anything not clear. Comments are more than welcome.

        Show
        R0ger Roger Shi added a comment - Hi, I have uploaded a design draft. Please let me know if anything not clear. Comments are more than welcome.
        Hide
        liyang.gmt8@gmail.com liyang added a comment -

        Think this is done. Right? Roger Shi

        Show
        liyang.gmt8@gmail.com liyang added a comment - Think this is done. Right? Roger Shi
        Hide
        yanghaogn Yang Hao added a comment -

        Roger Shi I have met the same problem. Some data are useless, we want another paramter to filter the data that have dimension with null. Can you consider it?

        Show
        yanghaogn Yang Hao added a comment - Roger Shi I have met the same problem. Some data are useless, we want another paramter to filter the data that have dimension with null. Can you consider it?
        Hide
        yanghaogn Yang Hao added a comment -

        If there is no one to solve the problem, I want to solve it by not saving data in step "Convert Cuboid Data to HFile". If the max dimension N has set, then a row key with more than N column will be filtered. How about this solution. Roger Shi liyang fengYu

        Show
        yanghaogn Yang Hao added a comment - If there is no one to solve the problem, I want to solve it by not saving data in step "Convert Cuboid Data to HFile". If the max dimension N has set, then a row key with more than N column will be filtered. How about this solution. Roger Shi liyang fengYu
        Hide
        Shaofengshi Shaofeng SHI added a comment -

        Filtering at the "Convert Cuboid Data to HFile" step is too late, as those cuboids have already been calculated.

        Show
        Shaofengshi Shaofeng SHI added a comment - Filtering at the "Convert Cuboid Data to HFile" step is too late, as those cuboids have already been calculated.
        Hide
        yanghaogn Yang Hao added a comment -

        Shaofeng SHI Yes, it's the simplest way, but it's time-consuming and space-consuming. Best way is to filter the data in the cuboid generated step.

        Show
        yanghaogn Yang Hao added a comment - Shaofeng SHI Yes, it's the simplest way, but it's time-consuming and space-consuming. Best way is to filter the data in the cuboid generated step.
        Hide
        liyang.gmt8@gmail.com liyang added a comment -

        The work has been mostly done. There is a "dim_cap" field in aggregation group which does what this JIRA want. I'm not sure how it reflects on GUI however.

        commit e0f30d100b7afc73326538d4d8a57b973f57013b
        Author: lidongsjtu <lidong@apache.org>
        Date: Thu May 25 21:27:39 2017 +0800

        KYLIN-2363 minor update for cuboid api

        commit a1ccf02e297c3b655b707880aa27c9049f4b1b8b
        Author: Roger Shi <rogershijicheng@hotmail.com>
        Date: Thu May 25 19:22:15 2017 +0800

        KYLIN-2363 capping number of dimensions

        Show
        liyang.gmt8@gmail.com liyang added a comment - The work has been mostly done. There is a "dim_cap" field in aggregation group which does what this JIRA want. I'm not sure how it reflects on GUI however. commit e0f30d100b7afc73326538d4d8a57b973f57013b Author: lidongsjtu <lidong@apache.org> Date: Thu May 25 21:27:39 2017 +0800 KYLIN-2363 minor update for cuboid api commit a1ccf02e297c3b655b707880aa27c9049f4b1b8b Author: Roger Shi <rogershijicheng@hotmail.com> Date: Thu May 25 19:22:15 2017 +0800 KYLIN-2363 capping number of dimensions

          People

          • Assignee:
            R0ger Roger Shi
            Reporter:
            feng_xiao_yu fengYu
            Request participants:
            None
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated: