Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-117

Enhance YARN service model

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.4-alpha
    • 2.1.0-beta
    • None
    • None
    • Reviewed

    Description

      Having played the YARN service model, there are some issues
      that I've identified based on past work and initial use.

      This JIRA issue is an overall one to cover the issues, with solutions pushed out to separate JIRAs.

      state model prevents stopped state being entered if you could not successfully start the service.

      In the current lifecycle you cannot stop a service unless it was successfully started, but

      • init() may acquire resources that need to be explicitly released
      • if the start() operation fails partway through, the stop() operation may be needed to release resources.

      Fix: make stop() a valid state transition from all states and require the implementations to be able to stop safely without requiring all fields to be non null.

      Before anyone points out that the stop() operations assume that all fields are valid; and if called before a start() they will NPE; MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is independent of the rest of the issues in this doc but it will aid making stop() execute from all states other than "stopped".

      MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review and take up; this can be done with issues linked to this one.

      AbstractService doesn't prevent duplicate state change requests.

      The ensureState() checks to verify whether or not a state transition is allowed from the current state are performed in the base AbstractService class -yet subclasses tend to call this after their own init(), start() & stop() operations. This means that these operations can be performed out of order, and even if the outcome of the call is an exception, all actions performed by the subclasses will have taken place. MAPREDUCE-3877 demonstrates this.

      This is a tricky one to address. In HADOOP-3128 I used a base class instead of an interface and made the init(), start() & stop() methods final. These methods would do the checks, and then invoke protected inner methods, innerStart(), innerStop(), etc. It should be possible to retrofit the same behaviour to everything that extends AbstractService -something that must be done before the class is considered stable (because once the lifecycle methods are declared final, all subclasses that are out of the source tree will need fixing by the respective developers.

      AbstractService state change doesn't defend against race conditions.

      There's no concurrency locks on the state transitions. Whatever fix for wrong state calls is added should correct this to prevent re-entrancy, such as stop() being called from two threads.

      Static methods to choreograph of lifecycle operations

      Helper methods to move things through lifecycles. init->start is common, stop-if-service!=null another. Some static methods can execute these, and even call stop() if init() raises an exception. These could go into a class ServiceOps in the same package. These can be used by those services that wrap other services, and help manage more robust shutdowns.

      state transition failures are something that registered service listeners may wish to be informed of.

      When a state transition fails a RuntimeException can be thrown -and the service listeners are not informed as the notification point isn't reached. They may wish to know this, especially for management and diagnostics.

      Fix: extend ServiceStateChangeListener with a callback such as stateChangeFailed(Service service,Service.State targeted-state, RuntimeException e) that is invoked from the (final) state change methods in the AbstractService class (once they delegate to their inner innerStart(), innerStop() methods; make a no-op on the existing implementations of the interface.

      Service listener failures not handled

      Is this an error an error or not? Log and ignore may not be what is desired.

      Proposed: during stop() any exception by a listener is caught and discarded, to increase the likelihood of a better shutdown, but do not add try-catch clauses to the other state changes.

      Support static listeners for all AbstractServices

      Add support to AbstractService that allow callers to register listeners for all instances. The existing listener interface could be used. This allows management tools to hook into the events.

      The static listeners would be invoked for all state changes except creation (base class shouldn't be handing out references to itself at this point).

      These static events could all be async, pushed through a shared ConcurrentLinkedQueue; failures logged at warn and the rest of the listeners invoked.

      Add some example listeners for management/diagnostics

      • event to commons log for humans.
      • events for machines hooked up to the JSON logger.
      • for testing: something that be told to fail.

      Services should support signal interruptibility

      The services would benefit from a way of shutting them down on a kill signal; this can be done via a runtime hook. It should not be automatic though, as composite services will get into a very complex state during shutdown. Better to provide a hook that lets you register/unregister services to terminate, and have the relevant main() entry points tell their root services to register themselves.

      Attachments

        1. YARN-117.4.patch
          185 kB
          Steve Loughran
        2. YARN-117.5.patch
          186 kB
          Steve Loughran
        3. YARN-117.6.patch
          247 kB
          Steve Loughran
        4. YARN-117.patch
          198 kB
          Steve Loughran
        5. YARN-117-007.patch
          304 kB
          Steve Loughran
        6. YARN-117-008.patch
          273 kB
          Steve Loughran
        7. YARN-117-009.patch
          298 kB
          Steve Loughran
        8. YARN-117-010.patch
          232 kB
          Steve Loughran
        9. YARN-117-011.patch
          221 kB
          Steve Loughran
        10. YARN-117-012.patch
          216 kB
          Steve Loughran
        11. YARN-117-013.patch
          217 kB
          Steve Loughran
        12. YARN-117-014.patch
          216 kB
          Steve Loughran
        13. YARN-117-015.patch
          222 kB
          Steve Loughran
        14. YARN-117-016.patch
          222 kB
          Steve Loughran
        15. YARN-117-018.patch
          221 kB
          Steve Loughran
        16. YARN-117-019.patch
          282 kB
          Steve Loughran
        17. YARN-117-020.patch
          223 kB
          Steve Loughran
        18. YARN-117-021.patch
          223 kB
          Steve Loughran
        19. YARN-117-022.patch
          293 kB
          Steve Loughran
        20. YARN-117-023.patch
          284 kB
          Vinod Kumar Vavilapalli
        21. YARN-117-024.patch
          286 kB
          Vinod Kumar Vavilapalli
        22. YARN-117-025.patch
          286 kB
          Vinod Kumar Vavilapalli
        23. YARN-117-2.patch
          175 kB
          Steve Loughran
        24. YARN-117-3.patch
          176 kB
          Steve Loughran

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stevel@apache.org Steve Loughran
            stevel@apache.org Steve Loughran
            Votes:
            1 Vote for this issue
            Watchers:
            22 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment