Our users occasionally want to change their existing cube, such as adding/renaming/removing a dimension. Some of these changes require modifications to its source hive table. So our user changed the table schema and reloaded its metadata in Kylin, then several issues can happen depends on what he changed.
I did some schema changing tests based on 1.5.3, the results after reloading table are listed below
|type of changes||fact table||lookup table|
|minor||both query and build still works||query can fail or return wrong answer|
|major||fail to load related cube||fail to load related cube|
minor changes refer to those doesn't change columns used in cubes, such as insert/append new column, remove/change unused column.
major changes are the opposite, like remove/rename/change type of used column.
Clearly from the table, reload a changed table is problematic in certain cases.
KYLIN-1536 reports a similar problem.
So what can we do to support this kind of iterative development process (load -> define cube -> build -> reload -> change cube -> rebuild)?
My first thought is simply detect-and-prohibit reloading used table. User should be able to know which cube is preventing him from reloading, and then he could drop and recreate cube after reloading. However, defining a cube is not an easy task (consider editing 100 measures). Force users to recreate their cube over and over again will certainly not make them happy.
A better idea is to allow cube to be editable even if it's broken due to some columns changed after reloading. Broken cube can't be built or queried, it can only be edit or dropped. In fact, there is a cube status called RealizationStatusEnum.DESCBROKEN in code, but was never used. We should take advantage of it.
An enabled cube shouldn't allow schema changes, otherwise an unintentional reload could make it unavailable. Similarly, a disabled but unpurged cube shouldn't allow schema changes since it still has data in it.