[KYLIN-998] Finish the hive intermediate table clean up job in org.apache.kylin.job.hadoop.cube.StorageCleanupJob - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: v0.7.1, v0.7.2
Fix Version/s: v1.1
Component/s: Storage - HBase
Labels:
None

Description

Current kylin has its last cube building job step named “Garbage Collection” to remove the intermediate data in hdfs/hbase/hive. But if the job is accidentally stopped like problem in hadoop cluster, bad cube design, discarded by user, the data was left un-deleted.

In such cases, we can run "hbase org.apache.hadoop.util.RunJar $KYLIN_HOME/lib/kylin-job-0.8.1-incubating-SNAPSHOT.jar org.apache.kylin.job.hadoop.cube.StorageCleanupJob --delete true" to remove the data. But the method "cleanUnusedIntermediateHiveTable" is unfinished.

My first patch is to finish the method, it will remove unused hive tables with names begin with "kylin_intermediate_".

My second patch add some methods to enable deleting unused data with uuids in command line, or stored in a file.

I don't know whether the second patch is useful to you, it's used in our kylin server to remove data after one cube is deleted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KYLIN-998-0.7-staging.patch
06/Sep/15 06:39
4 kB
Shao Feng Shi
KYLIN-998-0.7-staging-v3.patch
07/Sep/15 08:45
6 kB
nichunen
KYLIN-998-0.8.patch
06/Sep/15 06:39
3 kB
Shao Feng Shi
KYLIN-998-0.8-v3.patch
07/Sep/15 08:45
6 kB
nichunen
KYLIN-998-UUIDS.patch
07/Sep/15 09:03
19 kB
nichunen

Activity

Ascending order - Click to sort in descending order

nichunen added a comment - 06/Sep/15 05:18

Upload the first patch for 0.7-staging and 0.8.

nichunen added a comment - 06/Sep/15 05:18 Upload the first patch for 0.7-staging and 0.8.

Shao Feng Shi added a comment - 06/Sep/15 06:36

+1 Looks good; It will check and exclude the intermediate tables for active jobs;

Shao Feng Shi added a comment - 06/Sep/15 06:36 +1 Looks good; It will check and exclude the intermediate tables for active jobs;

Shao Feng Shi added a comment - 06/Sep/15 06:42

I renamed these two patch files (add branch name into it) and attached here for online review; Thanks for your contribution, Chun En!

Shao Feng Shi added a comment - 06/Sep/15 06:42 I renamed these two patch files (add branch name into it) and attached here for online review; Thanks for your contribution, Chun En!

Shao Feng Shi added a comment - 06/Sep/15 07:12

-1 I just think of a case that this patch will cause issue; In our deployment, we have multiple Kylin installations in one Hadoop cluster; In this case, this patch couldn't differenciate whether the intermediate table was created by current installation or by another Kylin, it may delete a table that created by another Kylin; if that table is in use, it will cause the job in that Kylin failed.

So, it need check whether the intermediate table belongs to current Kylin instance, if yes continue, otherwise skip it; If you look at the code that pickup the usless HBase Tables, you will see it checks this by adding a tag in the table metainfo. For the hive intermediate table, this check can be performed by getting the table's storage location, see whether it is under the "hdfsWorkingDirectory" of current Kylin (KylinConfig.getHdfsWorkingDirectory), as different installations will have different HDFS locations;

Shao Feng Shi added a comment - 06/Sep/15 07:12 -1 I just think of a case that this patch will cause issue; In our deployment, we have multiple Kylin installations in one Hadoop cluster; In this case, this patch couldn't differenciate whether the intermediate table was created by current installation or by another Kylin, it may delete a table that created by another Kylin; if that table is in use, it will cause the job in that Kylin failed. So, it need check whether the intermediate table belongs to current Kylin instance, if yes continue, otherwise skip it; If you look at the code that pickup the usless HBase Tables, you will see it checks this by adding a tag in the table metainfo. For the hive intermediate table, this check can be performed by getting the table's storage location, see whether it is under the "hdfsWorkingDirectory" of current Kylin (KylinConfig.getHdfsWorkingDirectory), as different installations will have different HDFS locations;

nichunen added a comment - 06/Sep/15 08:32

Thanks Shaofeng,I get the problem of my code. I'll modify it and re-summit patches.

nichunen added a comment - 06/Sep/15 08:32 Thanks Shaofeng,I get the problem of my code. I'll modify it and re-summit patches.

nichunen added a comment - 07/Sep/15 05:46

Check whether the hive table belongs to current Kylin instance.

nichunen added a comment - 07/Sep/15 05:46 Check whether the hive table belongs to current Kylin instance.

Shao Feng Shi added a comment - 07/Sep/15 06:42

Hi Chunen, it seems not checking "delete == true", that means even user specify "--delete false", the tables will be dropped from Hive; To be consistent with other methods, could you please add that check and generate a new patch? Except this, I didn't see other issue; will merge once new patch be uploaded; Thanks for your time!

Shao Feng Shi added a comment - 07/Sep/15 06:42 Hi Chunen, it seems not checking "delete == true", that means even user specify "--delete false", the tables will be dropped from Hive; To be consistent with other methods, could you please add that check and generate a new patch? Except this, I didn't see other issue; will merge once new patch be uploaded; Thanks for your time!

nichunen added a comment - 07/Sep/15 08:45

Add "delete==true" check.
Hi Shaofeng, please check my new patches. Thanks.

nichunen added a comment - 07/Sep/15 08:45 Add "delete==true" check. Hi Shaofeng, please check my new patches. Thanks.

nichunen added a comment - 07/Sep/15 09:03

It's a patch to add some methods to enable deleting unused data with uuids in command line, or stored in a file. When we use kylin 0.6, I made this change to remove data, for kylin 0.6 didn't has the "Garbage Collection" step during building.

Shaofeng, if this patch is still useful, please give me some advise, I'll change my code and summit a new patch of it and a 0.8 one.

nichunen added a comment - 07/Sep/15 09:03 It's a patch to add some methods to enable deleting unused data with uuids in command line, or stored in a file. When we use kylin 0.6, I made this change to remove data, for kylin 0.6 didn't has the "Garbage Collection" step during building. Shaofeng, if this patch is still useful, please give me some advise, I'll change my code and summit a new patch of it and a 0.8 one.

Shao Feng Shi added a comment - 07/Sep/15 14:01

Hi ChunEn, I merged your latest patch, and based on your patch I made a small update, as I found you already parsed the job uuid from hive table name, so if we check whether allJobs contains the uuid, we would know whether it belongs to current deployment, so no need to check the corresponding HDFS exists or not; You can see the change in commit 277e1524f5be92ba03447ee33010f10b8de5ca75;

Regarding the ~~KYLIN-998~~-UUIDS.patch, as you know since 1.0 Kylin introduced the GC step to drop tables automatically; for some exceptional case the offline batch cleanup is good enough, so drop tables by job UUID will be less valuable; So I will hold this patch, and I suggest you upgrade your 0.6 deployments to 0.7.2 or above for getting bug-fixes and enhancements;

BTW, your patch is also applied in 0.8 (just renamed to 2.x-staging); Again, thanks for your contribution!

Shao Feng Shi added a comment - 07/Sep/15 14:01 Hi ChunEn, I merged your latest patch, and based on your patch I made a small update, as I found you already parsed the job uuid from hive table name, so if we check whether allJobs contains the uuid, we would know whether it belongs to current deployment, so no need to check the corresponding HDFS exists or not; You can see the change in commit 277e1524f5be92ba03447ee33010f10b8de5ca75; Regarding the KYLIN-998 -UUIDS.patch, as you know since 1.0 Kylin introduced the GC step to drop tables automatically; for some exceptional case the offline batch cleanup is good enough, so drop tables by job UUID will be less valuable; So I will hold this patch, and I suggest you upgrade your 0.6 deployments to 0.7.2 or above for getting bug-fixes and enhancements; BTW, your patch is also applied in 0.8 (just renamed to 2.x-staging); Again, thanks for your contribution!

nichunen added a comment - 08/Sep/15 02:17

Patch code merged to 1.x-staging.

nichunen added a comment - 08/Sep/15 02:17 Patch code merged to 1.x-staging.

Shao Feng Shi added a comment - 26/Oct/15 03:25

Resolved in release 1.1-incubating (2015-10-25)

Shao Feng Shi added a comment - 26/Oct/15 03:25 Resolved in release 1.1-incubating (2015-10-25)

People

Assignee:: nichunen

Reporter:: nichunen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Sep/15 13:31

Updated:: 17/Oct/18 01:18

Resolved:: 17/Oct/18 01:18