# Prune cuboids by capping number of dimensions

## Details

• Type: Improvement
• Status: Open
• Priority: Major
• Resolution: Unresolved
• Affects Version/s: None
• Fix Version/s: None
• Component/s: None
• Labels:
None

## Description

the scene like this:

I have 20+ dimensions, However the query will only use at most 5 dimensions in all dimensions, so cuboid that contains 5+ dimensions(except base cuboid) is useless.

I think we can add a configuration in cube, which limit the max dimensions that cuboid includes.

What's more, we can config which level(number of dimension) need to calculate. in above scene, we only calculate leve 1,2,3,4,5. and skip level 5+

## Attachments

1. Dimension Capping.md
3 kB
Roger Shi

## Activity

Hide
Billy Liu added a comment -

kylin.cube.aggrgroup.max-combination is used to limit the dimension combination.

Show
Billy Liu added a comment - kylin.cube.aggrgroup.max-combination is used to limit the dimension combination.
Hide
fengYu added a comment - Reporter

I know the configuration and what it means, In my scene, for example, I have 6 dimensions : A/B/C/D/E/F， I need at most 2 dimensions(in where and group by) in any query, so, I need calculate cuboids like AB/AC/AD/AE/AF/... and A/B/C/D/E/F and skip ABC/ABD/ACD/... and ABCD/ABCE...

so I need a configuration specify the max dimensions that cuboid contains which need to be calculated.

Show
fengYu added a comment - Reporter I know the configuration and what it means, In my scene, for example, I have 6 dimensions : A/B/C/D/E/F， I need at most 2 dimensions(in where and group by) in any query, so, I need calculate cuboids like AB/AC/AD/AE/AF/... and A/B/C/D/E/F and skip ABC/ABD/ACD/... and ABCD/ABCE... so I need a configuration specify the max dimensions that cuboid contains which need to be calculated.
Hide
Chuanlei Ni added a comment -

Maybe it is a possible optimization to specify the range of dimension numbers which the users will concern. For example, we will only query with [0, 5] dimensions in the group and where causes.

Show
Chuanlei Ni added a comment - Maybe it is a possible optimization to specify the range of dimension numbers which the users will concern. For example, we will only query with [0, 5] dimensions in the group and where causes.
Hide
Shaofeng SHI added a comment -

This is a good idea; it can help to purning many "big" cuboids; With this feature, user can apply some rules in front-end, for example, only allow user picks up to 5 dimensions among total 20 dimensions to query.

Show
Shaofeng SHI added a comment - This is a good idea; it can help to purning many "big" cuboids; With this feature, user can apply some rules in front-end, for example, only allow user picks up to 5 dimensions among total 20 dimensions to query.
Hide
fengYu added a comment - Reporter

Yes, that is what we want to do. I will working on it later.

Show
fengYu added a comment - Reporter Yes, that is what we want to do. I will working on it later.
Hide
XIE FAN added a comment -

I agree with this idea. And I think we can extend this idea a little: allow users to select exactly which cuboids they need in a visual way. Users can choose what cuboids they need in the front-end and only these cuboids will be materialized. For example, users can choose to calculate all the 1-D and 2-D cuboids and part of the 3-D, 4-D cuboids and exclude the other.

Show
XIE FAN added a comment - I agree with this idea. And I think we can extend this idea a little: allow users to select exactly which cuboids they need in a visual way. Users can choose what cuboids they need in the front-end and only these cuboids will be materialized. For example, users can choose to calculate all the 1-D and 2-D cuboids and part of the 3-D, 4-D cuboids and exclude the other.
Hide
fengYu added a comment - Reporter

yes, set a range or enumerate all levels to be calculated is a more user-friendly solution.

Show
fengYu added a comment - Reporter yes, set a range or enumerate all levels to be calculated is a more user-friendly solution.
Hide
Internal
Copperfield added a comment -

And maybe if we calculate from top?
In my opinion, kylin build all cuboids from 6 dimensions to 1dimension, ABCDEF -> A/B/C/D/E/F.
we can only build to 4 or 3 dimensions, ABCD/ABCE..,
the when query, we can use these existing results to calculate the result we want. It may be quicker than tail to top?

Show
Internal
Copperfield added a comment - And maybe if we calculate from top? In my opinion, kylin build all cuboids from 6 dimensions to 1dimension, ABCDEF -> A/B/C/D/E/F. we can only build to 4 or 3 dimensions, ABCD/ABCE.., the when query, we can use these existing results to calculate the result we want. It may be quicker than tail to top?
Hide
Billy Liu added a comment -

Thanks fengYu, this is very useful.

Show
Billy Liu added a comment - Thanks fengYu , this is very useful.
Hide
Internal
Chuanlei Ni added a comment - - edited

If we only have the 4-D cuboids, but we need to query about 3-D information.
Obviously, kylin cannot get the result from Hbase directly.
But can kylin calculate the aggregation based on the 4-D cuboid on fly?

Is this idea possible for the current kylin architecture which store the cube on hbase and do some post-processing via coprocessor?

@Kylin experts

Show
Internal
Chuanlei Ni added a comment - - edited If we only have the 4-D cuboids, but we need to query about 3-D information. Obviously, kylin cannot get the result from Hbase directly. But can kylin calculate the aggregation based on the 4-D cuboid on fly? Is this idea possible for the current kylin architecture which store the cube on hbase and do some post-processing via coprocessor? @Kylin experts
Hide
liyang added a comment -

Chuanlei Ni, sure Kylin will fallback to 4-D cuboid if requested 3-D cuboid is not available.

Show
liyang added a comment - Chuanlei Ni , sure Kylin will fallback to 4-D cuboid if requested 3-D cuboid is not available.
Hide
liyang added a comment -

Copperfield, the base cuboid (has the most dimensions) is always calculated. Keep a few levels of cuboid from the base is possible, however those cuboids tend to be the biggest ones.

Show
liyang added a comment - Copperfield , the base cuboid (has the most dimensions) is always calculated. Keep a few levels of cuboid from the base is possible, however those cuboids tend to be the biggest ones.
Hide
Roger Shi added a comment -

fengYu, you said you were working on this issue, how is it going?

Show
Roger Shi added a comment - fengYu , you said you were working on this issue, how is it going?
Hide
fengYu added a comment - Reporter

Roger Shi sorry for delay. I am waiting for the release of kylin 2.0, I want to add this feature beyond it, I think this week it will release and I will do this job.

Show
fengYu added a comment - Reporter Roger Shi sorry for delay. I am waiting for the release of kylin 2.0, I want to add this feature beyond it, I think this week it will release and I will do this job.
Hide
Roger Shi added a comment -

Hi, I have uploaded a design draft. Please let me know if anything not clear. Comments are more than welcome.

Show
Roger Shi added a comment - Hi, I have uploaded a design draft. Please let me know if anything not clear. Comments are more than welcome.
Hide
liyang added a comment -

Think this is done. Right? Roger Shi

Show
liyang added a comment - Think this is done. Right? Roger Shi
Hide
Yang Hao added a comment -

Roger Shi I have met the same problem. Some data are useless, we want another paramter to filter the data that have dimension with null. Can you consider it?

Show
Yang Hao added a comment - Roger Shi I have met the same problem. Some data are useless, we want another paramter to filter the data that have dimension with null. Can you consider it?
Hide
Yang Hao added a comment -

If there is no one to solve the problem, I want to solve it by not saving data in step "Convert Cuboid Data to HFile". If the max dimension N has set, then a row key with more than N column will be filtered. How about this solution. Roger Shi liyang fengYu

Show
Yang Hao added a comment - If there is no one to solve the problem, I want to solve it by not saving data in step "Convert Cuboid Data to HFile". If the max dimension N has set, then a row key with more than N column will be filtered. How about this solution. Roger Shi liyang fengYu
Hide
Shaofeng SHI added a comment -

Filtering at the "Convert Cuboid Data to HFile" step is too late, as those cuboids have already been calculated.

Show
Shaofeng SHI added a comment - Filtering at the "Convert Cuboid Data to HFile" step is too late, as those cuboids have already been calculated.
Hide
Yang Hao added a comment -

Shaofeng SHI Yes, it's the simplest way, but it's time-consuming and space-consuming. Best way is to filter the data in the cuboid generated step.

Show
Yang Hao added a comment - Shaofeng SHI Yes, it's the simplest way, but it's time-consuming and space-consuming. Best way is to filter the data in the cuboid generated step.
Hide
liyang added a comment -

The work has been mostly done. There is a "dim_cap" field in aggregation group which does what this JIRA want. I'm not sure how it reflects on GUI however.

commit e0f30d100b7afc73326538d4d8a57b973f57013b
Author: lidongsjtu <lidong@apache.org>
Date: Thu May 25 21:27:39 2017 +0800

KYLIN-2363 minor update for cuboid api

commit a1ccf02e297c3b655b707880aa27c9049f4b1b8b
Author: Roger Shi <rogershijicheng@hotmail.com>
Date: Thu May 25 19:22:15 2017 +0800

KYLIN-2363 capping number of dimensions

Show
liyang added a comment - The work has been mostly done. There is a "dim_cap" field in aggregation group which does what this JIRA want. I'm not sure how it reflects on GUI however. commit e0f30d100b7afc73326538d4d8a57b973f57013b Author: lidongsjtu <lidong@apache.org> Date: Thu May 25 21:27:39 2017 +0800 KYLIN-2363 minor update for cuboid api commit a1ccf02e297c3b655b707880aa27c9049f4b1b8b Author: Roger Shi <rogershijicheng@hotmail.com> Date: Thu May 25 19:22:15 2017 +0800 KYLIN-2363 capping number of dimensions

## People

• Assignee:
Roger Shi
Reporter:
fengYu
Request participants:
None