Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-998

Finish the hive intermediate table clean up job in org.apache.kylin.job.hadoop.cube.StorageCleanupJob

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • v0.7.1, v0.7.2
    • v1.1
    • Storage - HBase
    • None

    Description

      Current kylin has its last cube building job step named “Garbage Collection” to remove the intermediate data in hdfs/hbase/hive. But if the job is accidentally stopped like problem in hadoop cluster, bad cube design, discarded by user, the data was left un-deleted.

      In such cases, we can run "hbase org.apache.hadoop.util.RunJar $KYLIN_HOME/lib/kylin-job-0.8.1-incubating-SNAPSHOT.jar org.apache.kylin.job.hadoop.cube.StorageCleanupJob --delete true" to remove the data. But the method "cleanUnusedIntermediateHiveTable" is unfinished.

      My first patch is to finish the method, it will remove unused hive tables with names begin with "kylin_intermediate_".

      My second patch add some methods to enable deleting unused data with uuids in command line, or stored in a file.

      I don't know whether the second patch is useful to you, it's used in our kylin server to remove data after one cube is deleted.

      Attachments

        1. KYLIN-998-0.7-staging.patch
          4 kB
          Shao Feng Shi
        2. KYLIN-998-0.7-staging-v3.patch
          6 kB
          nichunen
        3. KYLIN-998-0.8.patch
          3 kB
          Shao Feng Shi
        4. KYLIN-998-0.8-v3.patch
          6 kB
          nichunen
        5. KYLIN-998-UUIDS.patch
          19 kB
          nichunen

        Activity

          nichunen nichunen added a comment -

          Upload the first patch for 0.7-staging and 0.8.

          nichunen nichunen added a comment - Upload the first patch for 0.7-staging and 0.8.
          shaofengshi Shao Feng Shi added a comment -

          +1 Looks good; It will check and exclude the intermediate tables for active jobs;

          shaofengshi Shao Feng Shi added a comment - +1 Looks good; It will check and exclude the intermediate tables for active jobs;
          shaofengshi Shao Feng Shi added a comment -

          I renamed these two patch files (add branch name into it) and attached here for online review; Thanks for your contribution, Chun En!

          shaofengshi Shao Feng Shi added a comment - I renamed these two patch files (add branch name into it) and attached here for online review; Thanks for your contribution, Chun En!
          shaofengshi Shao Feng Shi added a comment -

          -1 I just think of a case that this patch will cause issue; In our deployment, we have multiple Kylin installations in one Hadoop cluster; In this case, this patch couldn't differenciate whether the intermediate table was created by current installation or by another Kylin, it may delete a table that created by another Kylin; if that table is in use, it will cause the job in that Kylin failed.

          So, it need check whether the intermediate table belongs to current Kylin instance, if yes continue, otherwise skip it; If you look at the code that pickup the usless HBase Tables, you will see it checks this by adding a tag in the table metainfo. For the hive intermediate table, this check can be performed by getting the table's storage location, see whether it is under the "hdfsWorkingDirectory" of current Kylin (KylinConfig.getHdfsWorkingDirectory), as different installations will have different HDFS locations;

          shaofengshi Shao Feng Shi added a comment - -1 I just think of a case that this patch will cause issue; In our deployment, we have multiple Kylin installations in one Hadoop cluster; In this case, this patch couldn't differenciate whether the intermediate table was created by current installation or by another Kylin, it may delete a table that created by another Kylin; if that table is in use, it will cause the job in that Kylin failed. So, it need check whether the intermediate table belongs to current Kylin instance, if yes continue, otherwise skip it; If you look at the code that pickup the usless HBase Tables, you will see it checks this by adding a tag in the table metainfo. For the hive intermediate table, this check can be performed by getting the table's storage location, see whether it is under the "hdfsWorkingDirectory" of current Kylin (KylinConfig.getHdfsWorkingDirectory), as different installations will have different HDFS locations;
          nichunen nichunen added a comment -

          Thanks Shaofeng,I get the problem of my code. I'll modify it and re-summit patches.

          nichunen nichunen added a comment - Thanks Shaofeng,I get the problem of my code. I'll modify it and re-summit patches.
          nichunen nichunen added a comment -

          Check whether the hive table belongs to current Kylin instance.

          nichunen nichunen added a comment - Check whether the hive table belongs to current Kylin instance.
          shaofengshi Shao Feng Shi added a comment -

          Hi Chunen, it seems not checking "delete == true", that means even user specify "--delete false", the tables will be dropped from Hive; To be consistent with other methods, could you please add that check and generate a new patch? Except this, I didn't see other issue; will merge once new patch be uploaded; Thanks for your time!

          shaofengshi Shao Feng Shi added a comment - Hi Chunen, it seems not checking "delete == true", that means even user specify "--delete false", the tables will be dropped from Hive; To be consistent with other methods, could you please add that check and generate a new patch? Except this, I didn't see other issue; will merge once new patch be uploaded; Thanks for your time!
          nichunen nichunen added a comment -

          Add "delete==true" check.
          Hi Shaofeng, please check my new patches. Thanks.

          nichunen nichunen added a comment - Add "delete==true" check. Hi Shaofeng, please check my new patches. Thanks.
          nichunen nichunen added a comment -

          It's a patch to add some methods to enable deleting unused data with uuids in command line, or stored in a file. When we use kylin 0.6, I made this change to remove data, for kylin 0.6 didn't has the "Garbage Collection" step during building.

          Shaofeng, if this patch is still useful, please give me some advise, I'll change my code and summit a new patch of it and a 0.8 one.

          nichunen nichunen added a comment - It's a patch to add some methods to enable deleting unused data with uuids in command line, or stored in a file. When we use kylin 0.6, I made this change to remove data, for kylin 0.6 didn't has the "Garbage Collection" step during building. Shaofeng, if this patch is still useful, please give me some advise, I'll change my code and summit a new patch of it and a 0.8 one.
          shaofengshi Shao Feng Shi added a comment -

          Hi ChunEn, I merged your latest patch, and based on your patch I made a small update, as I found you already parsed the job uuid from hive table name, so if we check whether allJobs contains the uuid, we would know whether it belongs to current deployment, so no need to check the corresponding HDFS exists or not; You can see the change in commit 277e1524f5be92ba03447ee33010f10b8de5ca75;

          Regarding the KYLIN-998-UUIDS.patch, as you know since 1.0 Kylin introduced the GC step to drop tables automatically; for some exceptional case the offline batch cleanup is good enough, so drop tables by job UUID will be less valuable; So I will hold this patch, and I suggest you upgrade your 0.6 deployments to 0.7.2 or above for getting bug-fixes and enhancements;

          BTW, your patch is also applied in 0.8 (just renamed to 2.x-staging); Again, thanks for your contribution!

          shaofengshi Shao Feng Shi added a comment - Hi ChunEn, I merged your latest patch, and based on your patch I made a small update, as I found you already parsed the job uuid from hive table name, so if we check whether allJobs contains the uuid, we would know whether it belongs to current deployment, so no need to check the corresponding HDFS exists or not; You can see the change in commit 277e1524f5be92ba03447ee33010f10b8de5ca75; Regarding the KYLIN-998 -UUIDS.patch, as you know since 1.0 Kylin introduced the GC step to drop tables automatically; for some exceptional case the offline batch cleanup is good enough, so drop tables by job UUID will be less valuable; So I will hold this patch, and I suggest you upgrade your 0.6 deployments to 0.7.2 or above for getting bug-fixes and enhancements; BTW, your patch is also applied in 0.8 (just renamed to 2.x-staging); Again, thanks for your contribution!
          nichunen nichunen added a comment -

          Patch code merged to 1.x-staging.

          nichunen nichunen added a comment - Patch code merged to 1.x-staging.
          shaofengshi Shao Feng Shi added a comment -

          Resolved in release 1.1-incubating (2015-10-25)

          shaofengshi Shao Feng Shi added a comment - Resolved in release 1.1-incubating (2015-10-25)

          People

            nichunen nichunen
            nichunen nichunen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: