Oozie
  1. Oozie
  2. OOZIE-1783

Sharelib purging only occurs at Oozie startup

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: trunk
    • Fix Version/s: 4.1.0
    • Component/s: None
    • Labels:
      None

      Description

      The purgeLibs method only gets called on startup of the first Oozie server. This means that if you update the sharelib without restarting Oozie, then Oozie will never clean up the old sharelibs and launcherlibs until you restart the server. In fact, with Oozie HA, it will never clean up unless you restart all of the servers at the same time (in other words, a rolling restart or just restarting one Oozie will never cause purgeLibs to get called).

      This should be moved into a background thread (i.e. scheduled with SchedulerService) to run periodically. Once a day is probably good enough, but we could make it configurable if needed.

      1. OOZIE-1783.patch
        31 kB
        Robert Kanter
      2. OOZIE-1783.patch
        30 kB
        Robert Kanter

        Activity

        Robert Kanter created issue -
        Hide
        Purshotam Shah added a comment -

        How about making purgeShareLib a part of PurgeService?

        Show
        Purshotam Shah added a comment - How about making purgeShareLib a part of PurgeService?
        Hide
        Robert Kanter added a comment -

        It may be a good idea to put that there; my only concern would be if that adds a lot of unnecessary cross-dependency between the PurgeService and the ShareLibService. But it's probably a good idea for whoever works on this JIRA to investigate moving it.

        Show
        Robert Kanter added a comment - It may be a good idea to put that there; my only concern would be if that adds a lot of unnecessary cross-dependency between the PurgeService and the ShareLibService. But it's probably a good idea for whoever works on this JIRA to investigate moving it.
        Hide
        Robert Kanter added a comment -

        We should also make sure that there are appropriate error messages and error codes for when there's any problems with the purging.

        Show
        Robert Kanter added a comment - We should also make sure that there are appropriate error messages and error codes for when there's any problems with the purging.
        Robert Kanter made changes -
        Field Original Value New Value
        Assignee Robert Kanter [ rkanter ]
        Hide
        Robert Kanter added a comment -

        The patch looks a lot longer than it really is because I found a bunch of cases in TestShareLibService where we did services.init() and later services.destroy() but didn't use a try-finally block so Services could have leaked when there were test failures. The patch fixes these, which show up as a bunch of changes but I didn't really change anything here.

        Other than that, the patch basically just takes the existing call to purgeLibs, puts it into a Runnable, and schedules that to run at oozie.service.ShareLibService.purge.interval, which defaults to 1 day. I also updated TestShareLibService.testPurgeJar to wait for the purge runnable to run.

        Show
        Robert Kanter added a comment - The patch looks a lot longer than it really is because I found a bunch of cases in TestShareLibService where we did services.init() and later services.destroy() but didn't use a try-finally block so Services could have leaked when there were test failures. The patch fixes these, which show up as a bunch of changes but I didn't really change anything here. Other than that, the patch basically just takes the existing call to purgeLibs , puts it into a Runnable, and schedules that to run at oozie.service.ShareLibService.purge.interval , which defaults to 1 day. I also updated TestShareLibService.testPurgeJar to wait for the purge runnable to run.
        Robert Kanter made changes -
        Attachment OOZIE-1783.patch [ 12645112 ]
        Robert Kanter made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        Testing JIRA OOZIE-1783

        Cleaning local git workspace

        ----------------------------

        +1 PATCH_APPLIES
        +1 CLEAN
        +1 RAW_PATCH_ANALYSIS
        . +1 the patch does not introduce any @author tags
        . +1 the patch does not introduce any tabs
        . +1 the patch does not introduce any trailing spaces
        . +1 the patch does not introduce any line longer than 132
        . +1 the patch does adds/modifies 1 testcase(s)
        +1 RAT
        . +1 the patch does not seem to introduce new RAT warnings
        +1 JAVADOC
        . +1 the patch does not seem to introduce new Javadoc warnings
        +1 COMPILE
        . +1 HEAD compiles
        . +1 patch compiles
        . +1 the patch does not seem to introduce new javac warnings
        +1 BACKWARDS_COMPATIBILITY
        . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
        . +1 the patch does not modify JPA files
        -1 TESTS
        . Tests run: 1446
        . Tests failed: 3
        . Tests errors: 3

        . The patch failed the following testcases:

        . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)
        . testBundleEngineKill(org.apache.oozie.servlet.TestV1JobServletBundleEngine)
        . testBundleEngineResume(org.apache.oozie.servlet.TestV1JobServletBundleEngine)

        +1 DISTRO
        . +1 distro tarball builds with the patch

        ----------------------------
        -1 Overall result, please check the reported -1(s)

        The full output of the test-patch run is available at

        . https://builds.apache.org/job/oozie-trunk-precommit-build/1243/

        Show
        Hadoop QA added a comment - Testing JIRA OOZIE-1783 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 1 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings +1 COMPILE . +1 HEAD compiles . +1 patch compiles . +1 the patch does not seem to introduce new javac warnings +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1446 . Tests failed: 3 . Tests errors: 3 . The patch failed the following testcases: . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) . testBundleEngineKill(org.apache.oozie.servlet.TestV1JobServletBundleEngine) . testBundleEngineResume(org.apache.oozie.servlet.TestV1JobServletBundleEngine) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1243/
        Hide
        Rohini Palaniswamy added a comment -

        +1

        Show
        Rohini Palaniswamy added a comment - +1
        Hide
        Robert Kanter added a comment -

        New patch fixes related test failure in TestOozieShareLibCLI; I had to add the SchedulerService to the list of Services it was starting. Also, TestOozieShareLibCLI was never destroying the Services, so I added that.

        Show
        Robert Kanter added a comment - New patch fixes related test failure in TestOozieShareLibCLI; I had to add the SchedulerService to the list of Services it was starting. Also, TestOozieShareLibCLI was never destroying the Services, so I added that.
        Robert Kanter made changes -
        Attachment OOZIE-1783.patch [ 12645271 ]
        Hide
        Purshotam Shah added a comment -

        There might be other issue with sharelib purging in HA.

        Current purging logic
        ------
        Keep all share-lib between current timestamp and 7days old + 1 latest sharelib older than 7 days.
        https://issues.apache.org/jira/browse/OOZIE-1761
        -------

        Each server should maintain their own sharelib, otherwise sharelib use by one server might be purged.
        Ex. two servers are running and both using different sharelib older then 7 days, one used sharelib will be purged.

        At Yahoo, each server in HA has their own sharelib ( by setting different path for oozie.service.WorkflowAppService.system.libpath).
        For this approach we need to take out isFirstServer logic. So that each server can purge their own sharelib.

                            if (Services.get().get(JobsConcurrencyService.class).isFirstServer()) {
                                final Date current = Calendar.getInstance(TimeZone.getTimeZone("GMT")).getTime();
                                purgeLibs(fs, LAUNCHER_PREFIX, current);
                                purgeLibs(fs, SHARED_LIB_PREFIX, current);
                            }
        

        Other solution, sharelib service itself suffix hostname to path and works on path which has hostname ( for all operations retrieving sharelib, update and purging).
        In this case also we need to take away isFirstServer logic.

        Show
        Purshotam Shah added a comment - There might be other issue with sharelib purging in HA. Current purging logic ------ Keep all share-lib between current timestamp and 7days old + 1 latest sharelib older than 7 days. https://issues.apache.org/jira/browse/OOZIE-1761 ------- Each server should maintain their own sharelib, otherwise sharelib use by one server might be purged. Ex. two servers are running and both using different sharelib older then 7 days, one used sharelib will be purged. At Yahoo, each server in HA has their own sharelib ( by setting different path for oozie.service.WorkflowAppService.system.libpath). For this approach we need to take out isFirstServer logic. So that each server can purge their own sharelib. if (Services.get().get(JobsConcurrencyService.class).isFirstServer()) { final Date current = Calendar.getInstance(TimeZone.getTimeZone( "GMT" )).getTime(); purgeLibs(fs, LAUNCHER_PREFIX, current); purgeLibs(fs, SHARED_LIB_PREFIX, current); } Other solution, sharelib service itself suffix hostname to path and works on path which has hostname ( for all operations retrieving sharelib, update and purging). In this case also we need to take away isFirstServer logic.
        Hide
        Robert Kanter added a comment -

        Why would you want each server to have their own sharelib? They should have identical sharelibs or else if you submit to different Oozie servers you will have different jars...

        Show
        Robert Kanter added a comment - Why would you want each server to have their own sharelib? They should have identical sharelibs or else if you submit to different Oozie servers you will have different jars...
        Hide
        Rohini Palaniswamy added a comment -

        Purshotam Shah,
        Good point. But cannot expect everyone to have two different sharelibs. Our case is different as we don't have sharelibs in a directory but pick from a metafile and use the sharelib directory only for shipping launcher libs. In other cases they upload only to hdfs once and share it across all servers. There was also a jira to keep the old sharelib behavior where the directory has no timestamps. So I think we need to take all these cases into account and do the fix.

        My thought is we should keep the isFirstServer() check and have a different sub directory with hostname below oozie.service.WorkflowAppService.system.libpath directory (like your suggesstion) under which launcher libs will be shipped (if ship launcher is true). sharelib will continue to reside in the top-level of libpath directory. purgeLibs(fs, LAUNCHER_PREFIX, current); can then be done outside the isFirstServer() check.

        Show
        Rohini Palaniswamy added a comment - Purshotam Shah , Good point. But cannot expect everyone to have two different sharelibs. Our case is different as we don't have sharelibs in a directory but pick from a metafile and use the sharelib directory only for shipping launcher libs. In other cases they upload only to hdfs once and share it across all servers. There was also a jira to keep the old sharelib behavior where the directory has no timestamps. So I think we need to take all these cases into account and do the fix. My thought is we should keep the isFirstServer() check and have a different sub directory with hostname below oozie.service.WorkflowAppService.system.libpath directory (like your suggesstion) under which launcher libs will be shipped (if ship launcher is true). sharelib will continue to reside in the top-level of libpath directory. purgeLibs(fs, LAUNCHER_PREFIX, current); can then be done outside the isFirstServer() check.
        Hide
        Purshotam Shah added a comment -

        User can still use new sharelib behavior ( sharelib with timestamps).

        I was talking more in terms of HA.
        Not only for launchersharelib, even different timestamp of sharelib can be used by different host.

        This is not common scenario, it might happens. One Example, sharelib update didn't get propagated to one host (bcz of env issue). Each host might submit job with different sharelib jars for short interval. If purging happens at that time, then it will purge older sharelib, which is being used and all submitted will start failing.

        If we are not in favor of having different sharelib path for each host, them somehow we need to have a flag saying that which sharelib is used by each host. Having a new path in ZK and adding timestamp of sharelib and launcherlib being used by all host, might help.

        Show
        Purshotam Shah added a comment - User can still use new sharelib behavior ( sharelib with timestamps). I was talking more in terms of HA. Not only for launchersharelib, even different timestamp of sharelib can be used by different host. This is not common scenario, it might happens. One Example, sharelib update didn't get propagated to one host (bcz of env issue). Each host might submit job with different sharelib jars for short interval. If purging happens at that time, then it will purge older sharelib, which is being used and all submitted will start failing. If we are not in favor of having different sharelib path for each host, them somehow we need to have a flag saying that which sharelib is used by each host. Having a new path in ZK and adding timestamp of sharelib and launcherlib being used by all host, might help.
        Hide
        Purshotam Shah added a comment -

        Rohini Palaniswamy
        Yes, we need to move purgeLibs(fs, LAUNCHER_PREFIX, current) out of isFirstServer().

        Purging sharelib in HA can be slightly complicated, we can create a different JIRA for that.

        Show
        Purshotam Shah added a comment - Rohini Palaniswamy Yes, we need to move purgeLibs(fs, LAUNCHER_PREFIX, current) out of isFirstServer(). Purging sharelib in HA can be slightly complicated, we can create a different JIRA for that.
        Hide
        Robert Kanter added a comment -

        So then do we want to keep this patch as-is (to fix only purging at startup) and open a new JIRA to improve it for HA?

        Show
        Robert Kanter added a comment - So then do we want to keep this patch as-is (to fix only purging at startup) and open a new JIRA to improve it for HA?
        Hide
        Rohini Palaniswamy added a comment -

        I am fine with that.

        Show
        Rohini Palaniswamy added a comment - I am fine with that.
        Hide
        Hadoop QA added a comment -

        Testing JIRA OOZIE-1783

        Cleaning local git workspace

        ----------------------------

        +1 PATCH_APPLIES
        +1 CLEAN
        +1 RAW_PATCH_ANALYSIS
        . +1 the patch does not introduce any @author tags
        . +1 the patch does not introduce any tabs
        . +1 the patch does not introduce any trailing spaces
        . +1 the patch does not introduce any line longer than 132
        . +1 the patch does adds/modifies 2 testcase(s)
        +1 RAT
        . +1 the patch does not seem to introduce new RAT warnings
        +1 JAVADOC
        . +1 the patch does not seem to introduce new Javadoc warnings
        -1 COMPILE
        . +1 HEAD compiles
        . +1 patch compiles
        . -1 the patch seems to introduce 1 new javac warning(s)
        +1 BACKWARDS_COMPATIBILITY
        . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
        . +1 the patch does not modify JPA files
        -1 TESTS
        . Tests run: 1447
        . Tests failed: 2
        . Tests errors: 2

        . The patch failed the following testcases:

        . testSingleRecord(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI)
        . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)

        +1 DISTRO
        . +1 distro tarball builds with the patch

        ----------------------------
        -1 Overall result, please check the reported -1(s)

        The full output of the test-patch run is available at

        . https://builds.apache.org/job/oozie-trunk-precommit-build/1248/

        Show
        Hadoop QA added a comment - Testing JIRA OOZIE-1783 Cleaning local git workspace ---------------------------- +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS . +1 the patch does not introduce any @author tags . +1 the patch does not introduce any tabs . +1 the patch does not introduce any trailing spaces . +1 the patch does not introduce any line longer than 132 . +1 the patch does adds/modifies 2 testcase(s) +1 RAT . +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC . +1 the patch does not seem to introduce new Javadoc warnings -1 COMPILE . +1 HEAD compiles . +1 patch compiles . -1 the patch seems to introduce 1 new javac warning(s) +1 BACKWARDS_COMPATIBILITY . +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . +1 the patch does not modify JPA files -1 TESTS . Tests run: 1447 . Tests failed: 2 . Tests errors: 2 . The patch failed the following testcases: . testSingleRecord(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI) . testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) +1 DISTRO . +1 distro tarball builds with the patch ---------------------------- -1 Overall result, please check the reported -1(s) The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1248/
        Hide
        Purshotam Shah added a comment -

        Create different JIRA for sharelib purging for HA.
        https://issues.apache.org/jira/browse/OOZIE-1854 (Sharelib purging for HA)

        Show
        Purshotam Shah added a comment - Create different JIRA for sharelib purging for HA. https://issues.apache.org/jira/browse/OOZIE-1854 (Sharelib purging for HA)
        Hide
        Robert Kanter added a comment -

        Thanks for your input Puru and Rohini.
        Committed to trunk!

        Show
        Robert Kanter added a comment - Thanks for your input Puru and Rohini. Committed to trunk!
        Robert Kanter made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s trunk [ 12323240 ]
        Resolution Fixed [ 1 ]
        Rohini Palaniswamy made changes -
        Fix Version/s 4.1.0 [ 12324960 ]
        Fix Version/s trunk [ 12323240 ]
        Bowen Zhang made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Robert Kanter
            Reporter:
            Robert Kanter
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development