Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      Before implementing a scheduler, I think we should do some refactoring at first. There are 2 interfaces and 4 classes related to resource management, they are WorkerResourceManager , YarnTajoResourceManager, TajoWorkerResourceManager reside in TajoMaster, and ResourceAllocator, YarnResourceAllocator, TajoResourceAllocator reside in QueryMasters.

      WorkerResourceManager actually plays 3 roles

      1. Choose or start a QueryMaster for a query, and management it
      2. allocate resource for query tasks / task runners (only for standalone mode)
      3. Handle worker's heartbeat (only for standalone mode)

      If the scheduler is a decentralized one, like sparrow, we can allocate resource for a QueryMaster as the same way for a TaskRunner. So 1. and 2. can use the same interface, but called by 2 different caller. 3. is different from the others, we can create another service, let's say HeartbeatService to deal with worker's heartbeats.

      Any suggestion?

      1. TAJO-602.patch
        192 kB
        Hyunsik Choi

        Activity

        Hide
        hyunsik Hyunsik Choi added a comment -

        +1 for your idea.

        I totally agree with your suggestion. Regardless of whether an allocating resource is for QueryMaster or TaskRunner, we should use the same interface.

        Show
        hyunsik Hyunsik Choi added a comment - +1 for your idea. I totally agree with your suggestion. Regardless of whether an allocating resource is for QueryMaster or TaskRunner, we should use the same interface.
        Hide
        jihoonson Jihoon Son added a comment -

        +1 for this idea.
        It will make our architecture more clear.

        Show
        jihoonson Jihoon Son added a comment - +1 for this idea. It will make our architecture more clear.
        Hide
        hyunsik Hyunsik Choi added a comment -

        Could I take this issue?

        Show
        hyunsik Hyunsik Choi added a comment - Could I take this issue?
        Hide
        coderplay Min Zhou added a comment -

        Sure. Thanks, Hyunsik!

        Show
        coderplay Min Zhou added a comment - Sure. Thanks, Hyunsik!
        Hide
        hyunsik Hyunsik Choi added a comment -

        Thank you Min for assigning

        Show
        hyunsik Hyunsik Choi added a comment - Thank you Min for assigning
        Hide
        hyunsik Hyunsik Choi added a comment -

        I would like to share this progress with you guys.

        I'm working on it. I'm considering to change worker's resources and state to be implemented in yarn.state.StateMachine. I think that this approach is nice to handle complex states.

        Show
        hyunsik Hyunsik Choi added a comment - I would like to share this progress with you guys. I'm working on it. I'm considering to change worker's resources and state to be implemented in yarn.state.StateMachine. I think that this approach is nice to handle complex states.
        Hide
        coderplay Min Zhou added a comment -

        Seems over the scope of this ticket, meanwhile has some relationship with my current work TAJO-603. Could you file a another jira for that purpose?

        Show
        coderplay Min Zhou added a comment - Seems over the scope of this ticket, meanwhile has some relationship with my current work TAJO-603 . Could you file a another jira for that purpose?
        Hide
        hyunsik Hyunsik Choi added a comment -

        Min,

        You are right. Actually, I want to improve code readability and maintainability for those parts. If so, you guys can start your works without facing messy codes. I'll firstly change them with heartbeat services in another Jira issue. Thank you for your comment.

        Show
        hyunsik Hyunsik Choi added a comment - Min, You are right. Actually, I want to improve code readability and maintainability for those parts. If so, you guys can start your works without facing messy codes. I'll firstly change them with heartbeat services in another Jira issue. Thank you for your comment.
        Hide
        coderplay Min Zhou added a comment -

        I guess my work on TAJO-603 will conflict with your next step.

        Show
        coderplay Min Zhou added a comment - I guess my work on TAJO-603 will conflict with your next step.
        Hide
        hyunsik Hyunsik Choi added a comment -

        I won't change a wide range of code. The scope of my work is the same to this issue (TAJO-602). I just divided WorkerResourceManager into multiple parts. I don't touch QueryUnitAttempt and SubQuery.

        Show
        hyunsik Hyunsik Choi added a comment - I won't change a wide range of code. The scope of my work is the same to this issue ( TAJO-602 ). I just divided WorkerResourceManager into multiple parts. I don't touch QueryUnitAttempt and SubQuery.
        Hide
        hyunsik Hyunsik Choi added a comment - - edited

        Created a review request against branch master in reviewboard
        https://reviews.apache.org/r/18496/.

        Show
        hyunsik Hyunsik Choi added a comment - - edited Created a review request against branch master in reviewboard https://reviews.apache.org/r/18496/ .
        Hide
        hyunsik Hyunsik Choi added a comment -

        I've attached the patch for this issue. Above all, I'm very sorry for doing the job to separate the state part from TajoWorkerResourceManager.
        I tried to separate the job into another jira issue, but the job was very overlapped in this issue (TAJO-602). This is because TAJO-602 is for separating heartbeat service from WorkerResourceManager and the state machine is mostly for heartbeat service. Instead, I made much effort to narrow the scope of changes and keep existing logic as possible; although I found some parts which should be improved. This patch mostly changes WorkerResourceManager and TajoWorkerResourceManager. So, I believe tha this patch mostly does not affect Min's work (TAJO-603)

        During this work, I found some issues to be improved. Later, I'll create additional issues for them.

        In detail, this patch does as follows:

        • Separate the heartbeat service from TajoWorkerResourceManager into TajoResourceTracker class
        • Separate the worker information from WorkerResource into Worker class
        • Separate the ping expired node checker into WorkerLivelinessMonitor which extends AbstractLivelinessMonitor
        • Separate the resource heartbeat from TajoHeartbeat into NodeHeartbeat
          • Add one more heartbeat protocol and its service for that
        • Add TajoRMContext which contains active or inactive worker list managed by TajoWorkerResourceManager.
        • Extract the part to choose and start QueryMaster from WorkerResourceManager into QueryInProgress.
          • I still keep allocateQueryMaster method in order to avoid radical API changes.
          • But, allocateQueryMaster is changed to internally use allocateWorkerResources to allocate resources.
          • In other words, unlike before, TajoWorkerResourceManager manages resources for both QueryMaster and worker in one resource pool management.
        • Separate the heartbeat report service from Worker into WorkerHeartbeatService

        The unit tests are passed sucessfully. I've tested membership management of active/inactive workers, resource heartbeat, and some queries in a local cluster.

        In addition, I think that I don't have good naming sense. If you suggest better names for newly added classes, I'll appreciate it.

        Thanks!

        Show
        hyunsik Hyunsik Choi added a comment - I've attached the patch for this issue. Above all, I'm very sorry for doing the job to separate the state part from TajoWorkerResourceManager. I tried to separate the job into another jira issue, but the job was very overlapped in this issue ( TAJO-602 ). This is because TAJO-602 is for separating heartbeat service from WorkerResourceManager and the state machine is mostly for heartbeat service. Instead, I made much effort to narrow the scope of changes and keep existing logic as possible; although I found some parts which should be improved. This patch mostly changes WorkerResourceManager and TajoWorkerResourceManager. So, I believe tha this patch mostly does not affect Min's work ( TAJO-603 ) During this work, I found some issues to be improved. Later, I'll create additional issues for them. In detail, this patch does as follows: Separate the heartbeat service from TajoWorkerResourceManager into TajoResourceTracker class Separate the worker information from WorkerResource into Worker class Separate the ping expired node checker into WorkerLivelinessMonitor which extends AbstractLivelinessMonitor Separate the resource heartbeat from TajoHeartbeat into NodeHeartbeat Add one more heartbeat protocol and its service for that Add TajoRMContext which contains active or inactive worker list managed by TajoWorkerResourceManager. Extract the part to choose and start QueryMaster from WorkerResourceManager into QueryInProgress. I still keep allocateQueryMaster method in order to avoid radical API changes. But, allocateQueryMaster is changed to internally use allocateWorkerResources to allocate resources. In other words, unlike before, TajoWorkerResourceManager manages resources for both QueryMaster and worker in one resource pool management. Separate the heartbeat report service from Worker into WorkerHeartbeatService The unit tests are passed sucessfully. I've tested membership management of active/inactive workers, resource heartbeat, and some queries in a local cluster. In addition, I think that I don't have good naming sense. If you suggest better names for newly added classes, I'll appreciate it. Thanks!
        Hide
        tajoqa Tajo QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12631109/TAJO-602.patch
        against master revision 56fbd99.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The applied patch does not increase the total number of javadoc warnings.

        +1 checkstyle. The patch generated 0 code style errors.

        -1 findbugs. The patch appears to introduce 183 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-server tajo-common tajo-core/tajo-core-backend tajo-rpc.

        Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/161//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/161//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-core-backend.html
        Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/161//console

        This message is automatically generated.

        Show
        tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12631109/TAJO-602.patch against master revision 56fbd99. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The applied patch does not increase the total number of javadoc warnings. +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 183 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-server tajo-common tajo-core/tajo-core-backend tajo-rpc. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/161//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/161//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-core-backend.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/161//console This message is automatically generated.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tajo-master-build #93 (See https://builds.apache.org/job/Tajo-master-build/93/)
        TAJO-602: WorkerResourceManager should be broke down into 3 parts. (hyunsik) (hyunsik: https://git-wip-us.apache.org/repos/asf/?p=incubator-tajo.git&a=commit&h=18aaa68e5ea8acb8edb7e2486d02758ab0478127)

        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerContainer.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResourceManager.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryInfo.java
        • tajo-core/tajo-core-backend/pom.xml
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
        • tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/TestTajoResourceManager.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryMaster.java
        • tajo-catalog/tajo-catalog-server/src/main/java/org/apache/tajo/catalog/CatalogServer.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMasterClientService.java
        • tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/TajoWorker.java
        • CHANGES.txt
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java
        • tajo-rpc/src/main/java/org/apache/tajo/rpc/NettyServerBase.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMasterService.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/YarnTajoResourceManager.java
        • tajo-core/tajo-core-backend/src/main/resources/webapps/admin/query.jsp
        • tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryInProgress.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/TajoResourceAllocator.java
        • tajo-core/tajo-core-backend/src/main/proto/TajoMasterProtocol.proto
        • tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/TajoTestingCluster.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryJobManager.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/metrics/WorkerResourceMetricsGaugeSet.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerStatus.java
        • tajo-core/tajo-core-backend/src/main/resources/webapps/admin/index.jsp
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoContainerProxy.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMaster.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-build #93 (See https://builds.apache.org/job/Tajo-master-build/93/ ) TAJO-602 : WorkerResourceManager should be broke down into 3 parts. (hyunsik) (hyunsik: https://git-wip-us.apache.org/repos/asf/?p=incubator-tajo.git&a=commit&h=18aaa68e5ea8acb8edb7e2486d02758ab0478127 ) tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerContainer.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResourceManager.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryInfo.java tajo-core/tajo-core-backend/pom.xml tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/TestTajoResourceManager.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryMaster.java tajo-catalog/tajo-catalog-server/src/main/java/org/apache/tajo/catalog/CatalogServer.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMasterClientService.java tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/TajoWorker.java CHANGES.txt tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java tajo-rpc/src/main/java/org/apache/tajo/rpc/NettyServerBase.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMasterService.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/YarnTajoResourceManager.java tajo-core/tajo-core-backend/src/main/resources/webapps/admin/query.jsp tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryInProgress.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/TajoResourceAllocator.java tajo-core/tajo-core-backend/src/main/proto/TajoMasterProtocol.proto tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/TajoTestingCluster.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryJobManager.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/metrics/WorkerResourceMetricsGaugeSet.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerStatus.java tajo-core/tajo-core-backend/src/main/resources/webapps/admin/index.jsp tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoContainerProxy.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/TajoMaster.java
        Hide
        hyunsik Hyunsik Choi added a comment -

        This issue got +1 on RB. I committed it to master branch.

        Show
        hyunsik Hyunsik Choi added a comment - This issue got +1 on RB. I committed it to master branch.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-build #94 (See https://builds.apache.org/job/Tajo-master-build/94/)
        TAJO-602: WorkerResourceManager should be broke down into 3 parts. (missed files) (hyunsik: https://git-wip-us.apache.org/repos/asf/?p=incubator-tajo.git&a=commit&h=f5945424ed67fe18c67b15218c4176f592b145af)

        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerState.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/WorkerHeartbeatService.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerEvent.java
        • tajo-core/tajo-core-backend/src/main/proto/ResourceTrackerProtocol.proto
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerLivelinessMonitor.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerEventType.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoRMContext.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerStatusEvent.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoResourceTracker.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/Worker.java
        • tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/rm/TestTajoResourceManager.java
        • tajo-common/src/main/java/org/apache/tajo/util/ProtoBufUtil.java
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerReconnectEvent.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #94 (See https://builds.apache.org/job/Tajo-master-build/94/ ) TAJO-602 : WorkerResourceManager should be broke down into 3 parts. (missed files) (hyunsik: https://git-wip-us.apache.org/repos/asf/?p=incubator-tajo.git&a=commit&h=f5945424ed67fe18c67b15218c4176f592b145af ) tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerState.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/worker/WorkerHeartbeatService.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerEvent.java tajo-core/tajo-core-backend/src/main/proto/ResourceTrackerProtocol.proto tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerLivelinessMonitor.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerEventType.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoRMContext.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerStatusEvent.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoResourceTracker.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/Worker.java tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/rm/TestTajoResourceManager.java tajo-common/src/main/java/org/apache/tajo/util/ProtoBufUtil.java tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerReconnectEvent.java

          People

          • Assignee:
            hyunsik Hyunsik Choi
            Reporter:
            coderplay Min Zhou
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development