Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-540

(Umbrella) Implement Tajo Query Scheduler

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently, there is no Tajo query scheduler. So, all queries launched simultaneously compete cluster resource which is managed by TajoResourceManager.

      In this issue, we will investigate, design, and implement a Tajo query scheduler. This is an umbrella issue for that. We will create subtasks for them.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          blrunner Jaehwa Jung added a comment -

          +1 for the idea.

          Show
          blrunner Jaehwa Jung added a comment - +1 for the idea.
          Hide
          sirpkt Keuntae Park added a comment -

          +1 for the idea

          I think this is essential feature to support concurrent accesses of multiple users to the Tajo cluster.

          Show
          sirpkt Keuntae Park added a comment - +1 for the idea I think this is essential feature to support concurrent accesses of multiple users to the Tajo cluster.
          Hide
          charsyam DaeMyung Kang added a comment -

          +1 for the idea.
          I think this is very cool feature.

          Show
          charsyam DaeMyung Kang added a comment - +1 for the idea. I think this is very cool feature.
          Hide
          jihoonson Jihoon Son added a comment -

          +1 for this issue.
          It is mandatory for multi-query processing environments.

          Show
          jihoonson Jihoon Son added a comment - +1 for this issue. It is mandatory for multi-query processing environments.
          Hide
          coderplay Min Zhou added a comment -

          As discussed with Hyunsik before, I suggest using sparrow. Sparrow scheduler is one of the best choices of this kind of low-latency schedulers.

          References:
          http://www.cs.berkeley.edu/~matei/papers/2013/sosp_sparrow.pdf
          https://github.com/radlab/sparrow

          Show
          coderplay Min Zhou added a comment - As discussed with Hyunsik before, I suggest using sparrow. Sparrow scheduler is one of the best choices of this kind of low-latency schedulers. References: http://www.cs.berkeley.edu/~matei/papers/2013/sosp_sparrow.pdf https://github.com/radlab/sparrow
          Hide
          coderplay Min Zhou added a comment - - edited

          Ok, I got time to write a more detailed plan for this ticket.

          Historically, the first scheduler exists in hadoop ecosystem is the JobTracker in mapreduce. JobTracker actually plays two roles of a mapreduce cluster, one is resource management and the other is job tasks scheduling. Because of JobTracker's playing those two roles, the job response time and scalability of JobTracker is not good. This kind of issue also came across the ancestor of mapreduce - Google, which later start a projected named Borg with one of the goal to address this problem. Borg become a cluster resource management scheduler in Google, and its current version name from their paper is Omega.(see https://medium.com/large-scale-data-processing/a7a81f278e6f )

          Later this kind of resource scheduler appears into our vision. That's Mesos and Hadoop Yarn. The different between this 2 is mesos support gang scheduling and yarn support incremental scheduling. Both of them divided cluster scheduling into 2 layers, the higher layer is resource management, which is the responsibility of those two. They control the resource for each application/framework/job. Meanwhile, the other role for job tasks scheduling of a JobTracker is put down into a lower layer - Each application/framework/job's master coordinates the tasks for one application/framework/job.

          From our benchmarking, a job with 10 sleep zero ms tasks in hadoop 1.0 costed about 20 seconds because of JobTracker's scheduling. And Hadoop Yarn take the same level time as well. What we need here is not a scheduler as MRAppMaster, it's a low-latency scheduler. From Jeff Dean's paper ( http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/abstract ), we get a knowledge that Google is always beyond us. They developed a so called tied request technology to solve the low-latency requirements. please see the tied request section in http://static.googleusercontent.com/media/research.google.com/en//people/jeff/MIT_BigData_Sep2012.pdf if you can't download the acm paper.

          What we need here is actually a google's tied request like scheduling. Fortunately, we have a good candidate, sparrow, which actually was the scheduler of the first version of Impala (c++ version), and will be plugged into spark.

          I'd like to port sparrow into Tajo, but before that I think we need to discuss something first , cuz the structure will be radically changed.

          to be continuted.

          Show
          coderplay Min Zhou added a comment - - edited Ok, I got time to write a more detailed plan for this ticket. Historically, the first scheduler exists in hadoop ecosystem is the JobTracker in mapreduce. JobTracker actually plays two roles of a mapreduce cluster, one is resource management and the other is job tasks scheduling. Because of JobTracker's playing those two roles, the job response time and scalability of JobTracker is not good. This kind of issue also came across the ancestor of mapreduce - Google, which later start a projected named Borg with one of the goal to address this problem. Borg become a cluster resource management scheduler in Google, and its current version name from their paper is Omega.(see https://medium.com/large-scale-data-processing/a7a81f278e6f ) Later this kind of resource scheduler appears into our vision. That's Mesos and Hadoop Yarn. The different between this 2 is mesos support gang scheduling and yarn support incremental scheduling. Both of them divided cluster scheduling into 2 layers, the higher layer is resource management, which is the responsibility of those two. They control the resource for each application/framework/job. Meanwhile, the other role for job tasks scheduling of a JobTracker is put down into a lower layer - Each application/framework/job's master coordinates the tasks for one application/framework/job. From our benchmarking, a job with 10 sleep zero ms tasks in hadoop 1.0 costed about 20 seconds because of JobTracker's scheduling. And Hadoop Yarn take the same level time as well. What we need here is not a scheduler as MRAppMaster, it's a low-latency scheduler. From Jeff Dean's paper ( http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/abstract ), we get a knowledge that Google is always beyond us. They developed a so called tied request technology to solve the low-latency requirements. please see the tied request section in http://static.googleusercontent.com/media/research.google.com/en//people/jeff/MIT_BigData_Sep2012.pdf if you can't download the acm paper. What we need here is actually a google's tied request like scheduling. Fortunately, we have a good candidate, sparrow, which actually was the scheduler of the first version of Impala (c++ version), and will be plugged into spark. I'd like to port sparrow into Tajo, but before that I think we need to discuss something first , cuz the structure will be radically changed. to be continuted.
          Hide
          jihoonson Jihoon Son added a comment -

          Thanks, Min.
          It's really interesting. I'll investigate, too.

          Show
          jihoonson Jihoon Son added a comment - Thanks, Min. It's really interesting. I'll investigate, too.
          Hide
          sirpkt Keuntae Park added a comment -

          +1 for the idea,
          and your comment is really useful, Min.
          I'll check the papers you suggested.

          Show
          sirpkt Keuntae Park added a comment - +1 for the idea, and your comment is really useful, Min. I'll check the papers you suggested.
          Show
          coderplay Min Zhou added a comment - - edited For more information, I'd like to add some references Mesos: http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/2/ http://www.cs.berkeley.edu/~matei/papers/2011/nsdi_mesos.pdf http://mesos.berkeley.edu/mesos_tech_report.pdf Yarn: https://issues.apache.org/jira/browse/MAPREDUCE-279 https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf Sparrow on spark https://github.com/kayousterhout/spark/tree/sparrow/core/src/main/scala/spark/scheduler/sparrow
          Hide
          coderplay Min Zhou added a comment - - edited

          Continue my previous 2 comments. Sparrow improves "The power of two Choices" algorithm on 2 issues: 1) queued assignment can't accurately measure the real cost time of a task 2) the concurrent scheduling problem. You can check the sparrow paper for the details.

          As I mentioned, If we leverage a low-latency scheduler in an interactive or real-time system, we need radically change current design of tajo's scheduling.

          Firstly, the way we use Yarn is quite different from Spark and Impala. The resource requests are issued by Tajo workers, one container for one task/queryunit attempt. While spark and impala uses yarn as a higher layer scheduler for resource management. They use sparrow(-like) as their own internal scheduler in a lower layer for the purpose of low latency. Yarn is used for allocate the resources for a whole spark/impala cluster, not for a task. For example, if a spark cluster has 1 master and 10 slaves. The master need 10GB memory, and each of the slaves need 20GB memory. Yarn allocate a 10GB container for master daemon, and 20GB container for a slave daemon. Because those daemons are long-lived process, those resource are long time occupied by the spark cluster. Yarn revoke the resource only if one slave get decommissioned from the cluster. Here is my thought on tajo query scheduler, we can use yarn as higher layer resource management, yarn allocate cpu/memory resources to tajo master/querymaster/worker daemons. Sparrow-like scheduler coordinate query with those resources in a lower layer.

          Secondly, directly use sparrow is not proper. There are 3 reasons: 1) Sparrow need to start a scheduler daemon on each machine, is not convenient to operate 2) Sparrow support multitenancy, in another words, sparrow has user authentication, which tajo don't support yet. 3) sparrow can't kill a job currently. But the algorithm behind sparrow is quite suitable for tajo.

          Show
          coderplay Min Zhou added a comment - - edited Continue my previous 2 comments. Sparrow improves "The power of two Choices" algorithm on 2 issues: 1) queued assignment can't accurately measure the real cost time of a task 2) the concurrent scheduling problem. You can check the sparrow paper for the details. As I mentioned, If we leverage a low-latency scheduler in an interactive or real-time system, we need radically change current design of tajo's scheduling. Firstly, the way we use Yarn is quite different from Spark and Impala. The resource requests are issued by Tajo workers, one container for one task/queryunit attempt. While spark and impala uses yarn as a higher layer scheduler for resource management. They use sparrow(-like) as their own internal scheduler in a lower layer for the purpose of low latency. Yarn is used for allocate the resources for a whole spark/impala cluster, not for a task. For example, if a spark cluster has 1 master and 10 slaves. The master need 10GB memory, and each of the slaves need 20GB memory. Yarn allocate a 10GB container for master daemon, and 20GB container for a slave daemon. Because those daemons are long-lived process, those resource are long time occupied by the spark cluster. Yarn revoke the resource only if one slave get decommissioned from the cluster. Here is my thought on tajo query scheduler, we can use yarn as higher layer resource management, yarn allocate cpu/memory resources to tajo master/querymaster/worker daemons. Sparrow-like scheduler coordinate query with those resources in a lower layer. Secondly, directly use sparrow is not proper. There are 3 reasons: 1) Sparrow need to start a scheduler daemon on each machine, is not convenient to operate 2) Sparrow support multitenancy, in another words, sparrow has user authentication, which tajo don't support yet. 3) sparrow can't kill a job currently. But the algorithm behind sparrow is quite suitable for tajo.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Min,

          Thank you for sharing informative materials, history of resource manager and scheduler, and your ideas.
          There are many design considerations. Now, I need more time to digest them. I'll leave some comments soon

          Show
          hyunsik Hyunsik Choi added a comment - Min, Thank you for sharing informative materials, history of resource manager and scheduler, and your ideas. There are many design considerations. Now, I need more time to digest them. I'll leave some comments soon
          Hide
          coderplay Min Zhou added a comment -

          I'd like to write more about the code implementation.

          Currently, tajo has 2 threads for sheduling. One which is standalone mode schedules queries by tajo itself, the other is yarn mode where queries resources are allocated by a yarn cluster.

          Regarding to the standalone mode. When TajoMaster receives a query, it firstly use TajoWorkerResourceManager to choose a idler QueryMaster for this query. Then the query will be sent to that QueryMaster. QueryMaster breaks the query into several execution blocks, each execution block consists of several tasks. TajoResourceAllocator reside in that QueryMaster send a rpc call TajoMasterProtocol.allocateWorkerResources() to TajoMaster. TajoMaster then use TajoWorkerResourceManager again to allocate workers for execution block. Yarn mode is quite similar.

          What we want to do some change is for the standalone mode, right?

          Show
          coderplay Min Zhou added a comment - I'd like to write more about the code implementation. Currently, tajo has 2 threads for sheduling. One which is standalone mode schedules queries by tajo itself, the other is yarn mode where queries resources are allocated by a yarn cluster. Regarding to the standalone mode. When TajoMaster receives a query, it firstly use TajoWorkerResourceManager to choose a idler QueryMaster for this query. Then the query will be sent to that QueryMaster. QueryMaster breaks the query into several execution blocks, each execution block consists of several tasks. TajoResourceAllocator reside in that QueryMaster send a rpc call TajoMasterProtocol.allocateWorkerResources() to TajoMaster. TajoMaster then use TajoWorkerResourceManager again to allocate workers for execution block. Yarn mode is quite similar. What we want to do some change is for the standalone mode, right?
          Hide
          jihoonson Jihoon Son added a comment -

          I also think so. Actually, in my opinion, the current implementation of yarn mode is not useful because its scheduling latency is too high. I think that we should adopt the way that Spark and Impala use yarn.

          Show
          jihoonson Jihoon Son added a comment - I also think so. Actually, in my opinion, the current implementation of yarn mode is not useful because its scheduling latency is too high. I think that we should adopt the way that Spark and Impala use yarn.
          Hide
          coderplay Min Zhou added a comment - - edited

          Go ahead. From a deep investigation of mine, we can keep the yarn thread, only need some refactoring in order to keep the same interface as standalone mode scheduling.

          Currently, standalone mode scheduling is something like FIFO centralized scheduling, if the previous query occupies all of the slots of workers, the succeeding query will be blocked. We have 2 choices, the first one is change the FIFO strategy into another one, like fair share. But this hadoop jobtracker like scheduling can't achieve a very low latency and good scalalibity. The second one is porting sparrow into tajo.

          If we want to port sparrow, we need to do one thing in advance. That is , due to sparrow is a decentralized algorithm, typically every node has a scheduling service deployed. Those schedulers need to know every node's status. Actually, Impala has a Statestore daemon to offer this kind of service.
          see The Impala Statestore section at http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_concepts.html
          This is also called service discovery. Facebook presto has such component as well.
          see https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/metadata/DiscoveryNodeManager.java

          For long term purpose, we need add a service discovery component not only for scheduling, but also for high availability. Fortunately, we needn't build a service discovery from scratch. There a lot of open source projects for this. One of the most famous is zookeeper.
          see http://www.javacodegeeks.com/2013/11/coordination-and-service-discovery-with-apache-zookeeper.html
          A better library built on the top of zookeeper https://github.com/Netflix/curator/wiki/Service-Discovery
          see http://blog.palominolabs.com/2012/08/14/using-netflix-curator-for-service-discovery/

          For short term. I think TajoMaster already hold the status of all workers. Each worker can fetch all workers' address through a rpc send to TajoMaster. If we have such information in the worker side, we can embed sparrow like scheduler as a optional service into worker.

          Show
          coderplay Min Zhou added a comment - - edited Go ahead. From a deep investigation of mine, we can keep the yarn thread, only need some refactoring in order to keep the same interface as standalone mode scheduling. Currently, standalone mode scheduling is something like FIFO centralized scheduling, if the previous query occupies all of the slots of workers, the succeeding query will be blocked. We have 2 choices, the first one is change the FIFO strategy into another one, like fair share. But this hadoop jobtracker like scheduling can't achieve a very low latency and good scalalibity. The second one is porting sparrow into tajo. If we want to port sparrow, we need to do one thing in advance. That is , due to sparrow is a decentralized algorithm, typically every node has a scheduling service deployed. Those schedulers need to know every node's status. Actually, Impala has a Statestore daemon to offer this kind of service. see The Impala Statestore section at http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_concepts.html This is also called service discovery. Facebook presto has such component as well. see https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/metadata/DiscoveryNodeManager.java For long term purpose, we need add a service discovery component not only for scheduling, but also for high availability. Fortunately, we needn't build a service discovery from scratch. There a lot of open source projects for this. One of the most famous is zookeeper. see http://www.javacodegeeks.com/2013/11/coordination-and-service-discovery-with-apache-zookeeper.html A better library built on the top of zookeeper https://github.com/Netflix/curator/wiki/Service-Discovery see http://blog.palominolabs.com/2012/08/14/using-netflix-curator-for-service-discovery/ For short term. I think TajoMaster already hold the status of all workers. Each worker can fetch all workers' address through a rpc send to TajoMaster. If we have such information in the worker side, we can embed sparrow like scheduler as a optional service into worker.
          Hide
          coderplay Min Zhou added a comment -

          Here is the document describe how impala integrate with yarn. http://cloudera.github.io/llama/.

          Show
          coderplay Min Zhou added a comment - Here is the document describe how impala integrate with yarn. http://cloudera.github.io/llama/ .
          Hide
          coderplay Min Zhou added a comment -

          Jihoon Son

          Seems TajoWorkerResourceManager.WorkerResourceAllocationThread.chooseWorkers() try to solve the problem which kind of resource, memory or disk, is more critical to the container's requirement, right?

          Do you know how yarn solve it? Here is a paper which is used in yarn's all kind of scheduler
          http://www.cs.berkeley.edu/~matei/papers/2011/nsdi_drf.pdf

          Show
          coderplay Min Zhou added a comment - Jihoon Son Seems TajoWorkerResourceManager.WorkerResourceAllocationThread.chooseWorkers() try to solve the problem which kind of resource, memory or disk, is more critical to the container's requirement, right? Do you know how yarn solve it? Here is a paper which is used in yarn's all kind of scheduler http://www.cs.berkeley.edu/~matei/papers/2011/nsdi_drf.pdf
          Hide
          jihoonson Jihoon Son added a comment -

          Min, really appreciate for sharing various articles and papers.
          For your questions, you are right. TajoWorkerResourceManager manages and schedules the worker resources.

          Actually, I don't have sufficient backgrounds for query scheduling. So, I'm still investigating for the deep understanding of the query scheduling. I'll read the DRF paper, too.

          Many thanks,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Min, really appreciate for sharing various articles and papers. For your questions, you are right. TajoWorkerResourceManager manages and schedules the worker resources. Actually, I don't have sufficient backgrounds for query scheduling. So, I'm still investigating for the deep understanding of the query scheduling. I'll read the DRF paper, too. Many thanks, Jihoon
          Hide
          coderplay Min Zhou added a comment -

          Here is a comment from Mafish on TAJO-611

          Hi There,
          I've done some basic investigation on the discovery service. Min give a very detailed discussion about the resource managements on the comment section of Tajo-540. That's very useful for me. Thanks Min. Now I have some questions to discuss.
          What's the current resource management mechanism in Tajo and what are related classes? Based on your previous discussion, it seems Tajo uses Yarn, but not at CPU/Memory level. Do we need a resource management with more granularity? It seem this question is more related to Tajo-540.

          Yarn support cpu/memory kind of resource management. Tajo right now can't accurately calculate the exact cpu/memory resource a task should acquire for, it instead ask for a rough number of those resource from Yarn. Nevertheless, at least tajo support resource management in both Yarn and Standalone modes, we can improve it after statistics get more affluent.

          Show
          coderplay Min Zhou added a comment - Here is a comment from Mafish on TAJO-611 Hi There, I've done some basic investigation on the discovery service. Min give a very detailed discussion about the resource managements on the comment section of Tajo-540. That's very useful for me. Thanks Min. Now I have some questions to discuss. What's the current resource management mechanism in Tajo and what are related classes? Based on your previous discussion, it seems Tajo uses Yarn, but not at CPU/Memory level. Do we need a resource management with more granularity? It seem this question is more related to Tajo-540. Yarn support cpu/memory kind of resource management. Tajo right now can't accurately calculate the exact cpu/memory resource a task should acquire for, it instead ask for a rough number of those resource from Yarn. Nevertheless, at least tajo support resource management in both Yarn and Standalone modes, we can improve it after statistics get more affluent.
          Hide
          jihoonson Jihoon Son added a comment -

          Hi, Min Zhou.

          Apologize for late reply. It took so long because I don't have much backgrounds for query scheduling.
          Actually, I'm still investigating more researches about query scheduling, but I've just read the Sparrow paper.
          I'm very impressed by Sparrow, and agree on that we should adopt it for low latency scheduling.
          Please go on this issue, and share the progress occasionally.

          Thanks,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Hi, Min Zhou . Apologize for late reply. It took so long because I don't have much backgrounds for query scheduling. Actually, I'm still investigating more researches about query scheduling, but I've just read the Sparrow paper. I'm very impressed by Sparrow, and agree on that we should adopt it for low latency scheduling. Please go on this issue, and share the progress occasionally. Thanks, Jihoon
          Hide
          coderplay Min Zhou added a comment -

          Hi Jihoon,

          Sorry that havenot update for so long a time. I will reply soon.

          Min

          Show
          coderplay Min Zhou added a comment - Hi Jihoon, Sorry that havenot update for so long a time. I will reply soon. Min
          Hide
          hyunsik Hyunsik Choi added a comment -

          I'm sorry too for commenting lazily. I'll comment soon.

          Show
          hyunsik Hyunsik Choi added a comment - I'm sorry too for commenting lazily. I'll comment soon.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi Min Zhou,

          Thank you for sharing nice articles. I'm very sorry for late response due to preparing the 0.8.0 release work and the work as an employee. Also, for one month, I've investigated the background of this area that you listed.

          I'd like to discuss some questions and suggestions that you threw. First of all, I agree with your suggestion for porting sparrow to Tajo. Now, we need the query scheduler to well support multiple users and multiple running queries. As you mentioned, sparrow is proper to these requirements while it is low latency.

          Also, you concerned with the way we use Yarn, and you propose two ways. One of them is that we use yarn as higher layer resource management and sparrow-like scheduler as a lower layer scheduler. I'd like to throw +1 for this suggestion.

          One of problems you concerned was that this approach will cause the radical change. Especially, there are two-types schedulers. The many dependent modules make this work very hard.

          So, I suggest to comment out all TajoYarnResourceManager and the scheduler code related to Yarn. Then, we only focus on the sparrow-like scheduler for standalone scheduler. Later, we can recover the yarn scheduler. I think that this way makes our work more faster.

          What do you think about my proposal? After we get some agreement, we can discuss several stages for this work. I'm looking forward to your response.

          Show
          hyunsik Hyunsik Choi added a comment - Hi Min Zhou , Thank you for sharing nice articles. I'm very sorry for late response due to preparing the 0.8.0 release work and the work as an employee. Also, for one month, I've investigated the background of this area that you listed. I'd like to discuss some questions and suggestions that you threw. First of all, I agree with your suggestion for porting sparrow to Tajo. Now, we need the query scheduler to well support multiple users and multiple running queries. As you mentioned, sparrow is proper to these requirements while it is low latency. Also, you concerned with the way we use Yarn, and you propose two ways. One of them is that we use yarn as higher layer resource management and sparrow-like scheduler as a lower layer scheduler. I'd like to throw +1 for this suggestion. One of problems you concerned was that this approach will cause the radical change. Especially, there are two-types schedulers. The many dependent modules make this work very hard. So, I suggest to comment out all TajoYarnResourceManager and the scheduler code related to Yarn. Then, we only focus on the sparrow-like scheduler for standalone scheduler. Later, we can recover the yarn scheduler. I think that this way makes our work more faster. What do you think about my proposal? After we get some agreement, we can discuss several stages for this work. I'm looking forward to your response.
          Hide
          hsaputra Henry Saputra added a comment -

          If not one object, I would like to add subtask for integration with YARN as resource manager for this ticket.

          Show
          hsaputra Henry Saputra added a comment - If not one object, I would like to add subtask for integration with YARN as resource manager for this ticket.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi Henry Saputra,

          Sounds nice. I think that it would be nice to create another top-level Jira issue for Yarn integration. In my opinion, the integration is related to this issue, but it would not be the subtasks of this issue.

          Regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi Henry Saputra , Sounds nice. I think that it would be nice to create another top-level Jira issue for Yarn integration. In my opinion, the integration is related to this issue, but it would not be the subtasks of this issue. Regards, Hyunsik
          Hide
          hsaputra Henry Saputra added a comment -

          Ah ok, thx Hyunsik, will do

          Show
          hsaputra Henry Saputra added a comment - Ah ok, thx Hyunsik, will do
          Hide
          coderplay Min Zhou added a comment -

          Hyunsik Choi

          Sure. Since no one use that kind of yarn mode, commenting out is a agile way. In fact, I was lost in those yarn related code.
          Nice!

          Min

          Show
          coderplay Min Zhou added a comment - Hyunsik Choi Sure. Since no one use that kind of yarn mode, commenting out is a agile way. In fact, I was lost in those yarn related code. Nice! Min
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          Hi Min,

          After TAJO-752, I'd like to comment out Yarn related code in a subtask.

          In addition to TAJO-603, I'd like to actively participate in this work. Also, I propose making another branch for this work and its subtasks. In the branch, we can freely do this work and we would not concern with the broken or completeness of features while we doing it.

          It is just a question. Will you reuse some of the exiting Sparrow source code or implement it in the from-the-scratch manner?

          Also, I think that this issue may include as follows:

          • Refactoring Worker containers
            • Implementing Task Queues/node monitor in Worker
          • Refactoring RPC APIs among QueryMaster, TajoMaster, and Workers
          • Implementing Sparrow-like scheduler in QueryMaster

          Especially, Worker and RPC APIs are deeply related to other parts, and their codes are somewhat messy. If you are Ok, I'd like to firstly start the refactoring of worker and RPC APIs in order to help you easily do this work. What do you think about that?

          Best regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - - edited Hi Min, After TAJO-752 , I'd like to comment out Yarn related code in a subtask. In addition to TAJO-603 , I'd like to actively participate in this work. Also, I propose making another branch for this work and its subtasks. In the branch, we can freely do this work and we would not concern with the broken or completeness of features while we doing it. It is just a question. Will you reuse some of the exiting Sparrow source code or implement it in the from-the-scratch manner? Also, I think that this issue may include as follows: Refactoring Worker containers Implementing Task Queues/node monitor in Worker Refactoring RPC APIs among QueryMaster, TajoMaster, and Workers Implementing Sparrow-like scheduler in QueryMaster Especially, Worker and RPC APIs are deeply related to other parts, and their codes are somewhat messy. If you are Ok, I'd like to firstly start the refactoring of worker and RPC APIs in order to help you easily do this work. What do you think about that? Best regards, Hyunsik
          Hide
          coderplay Min Zhou added a comment -

          I think implement it from the scratch is more reasonable because tajo has its own specificity. For example resource aware scheduling, disk volume id aware, etc.

          That's would be great to me you will do this, thank you very much! Really appreciate it!.

          Best regards,
          Min

          Show
          coderplay Min Zhou added a comment - I think implement it from the scratch is more reasonable because tajo has its own specificity. For example resource aware scheduling, disk volume id aware, etc. That's would be great to me you will do this, thank you very much! Really appreciate it!. Best regards, Min
          Hide
          hyunsik Hyunsik Choi added a comment -

          I got your point. Let's go head. I'll start the removal of yarn part. Thanks!

          Best regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - I got your point. Let's go head. I'll start the removal of yarn part. Thanks! Best regards, Hyunsik
          Hide
          jihoonson Jihoon Son added a comment -

          Sorry.
          The assignment was my mistake.

          Show
          jihoonson Jihoon Son added a comment - Sorry. The assignment was my mistake.
          Hide
          jhkim Jinho Kim added a comment -

          Guys, Could you share the progress on this issue ? if not, I will start the TajoResourceManager
          Because we need physical memory management

          Show
          jhkim Jinho Kim added a comment - Guys, Could you share the progress on this issue ? if not, I will start the TajoResourceManager Because we need physical memory management
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi Jinho Kim,

          Probably, there is no progress. I already know that you made the fair scheduler and adopted the scheduler to some production clusters in some company. I think that we can continue this work from your work.

          But, it is very huge work. Before staring implementation, we need enough discussion and need to plan the details.

          Show
          hyunsik Hyunsik Choi added a comment - Hi Jinho Kim , Probably, there is no progress. I already know that you made the fair scheduler and adopted the scheduler to some production clusters in some company. I think that we can continue this work from your work. But, it is very huge work. Before staring implementation, we need enough discussion and need to plan the details.
          Hide
          jhkim Jinho Kim added a comment -

          Hi Hyunsik Choi
          Actually, I have no detailed plan yet.
          I did implement trunk-based fair-scheduler and It does not scale out on big cluster
          In my opinion, we must change the ResourceManager.

          Here is my simple plan:
          1. Add ResourceManager in TajoWorker (consider the physical memory and cpu)
          2. Add WorkerMonitor and TaskSchedulers in TajoWorker (sparrow like)
          3. Add TaskExecutor in TajoWorker
          4. Add TaskLoadBalancer in QueryMaster (probe the workers and throttling the task)
          5. Add FrontendQueryScheduler in TajoMaster

          I will attach the sequence diagram.
          If you have any idea, please feel free to give your opinion.

          Show
          jhkim Jinho Kim added a comment - Hi Hyunsik Choi Actually, I have no detailed plan yet. I did implement trunk-based fair-scheduler and It does not scale out on big cluster In my opinion, we must change the ResourceManager. Here is my simple plan: 1. Add ResourceManager in TajoWorker (consider the physical memory and cpu) 2. Add WorkerMonitor and TaskSchedulers in TajoWorker (sparrow like) 3. Add TaskExecutor in TajoWorker 4. Add TaskLoadBalancer in QueryMaster (probe the workers and throttling the task) 5. Add FrontendQueryScheduler in TajoMaster I will attach the sequence diagram. If you have any idea, please feel free to give your opinion.
          Hide
          jihoonson Jihoon Son added a comment -

          Jinho Kim, thanks for sharing your plan.
          However, IMHO, we need to start from the drawing a big picture of this issue.
          The big picture may contain your goal, challenges, problem solving approaches, etc.
          Do you plan to implement a Sparrow-like query scheduler?

          In addition, it would be great if you share some materials of the background as Min Zhou did.

          Thanks,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Jinho Kim , thanks for sharing your plan. However, IMHO, we need to start from the drawing a big picture of this issue. The big picture may contain your goal, challenges, problem solving approaches, etc. Do you plan to implement a Sparrow-like query scheduler? In addition, it would be great if you share some materials of the background as Min Zhou did. Thanks, Jihoon
          Hide
          jhkim Jinho Kim added a comment -

          Jihoon Son
          What is mean the Big picture ? You means that ‘fault tolerance’ , ‘speculative execution’ … ?
          I was brief my simple plan. I think we should discuss the skeleton for scheduler in TAJO-540

          I’ve just refer to sparrow paper and source code. Because @coderplay was describe materials enough
          http://www.cs.berkeley.edu/~matei/papers/2013/sosp_sparrow.pdf
          https://github.com/radlab/sparrow

          Show
          jhkim Jinho Kim added a comment - Jihoon Son What is mean the Big picture ? You means that ‘fault tolerance’ , ‘speculative execution’ … ? I was brief my simple plan. I think we should discuss the skeleton for scheduler in TAJO-540 I’ve just refer to sparrow paper and source code. Because @coderplay was describe materials enough http://www.cs.berkeley.edu/~matei/papers/2013/sosp_sparrow.pdf https://github.com/radlab/sparrow
          Hide
          jihoonson Jihoon Son added a comment -

          Jinho Kim, thanks for your reply.
          I meant, we need to talk about implementation goals, problems which can be introduced in the implementation, and solutions of those problems.
          If you want to implement a Sparrow-like query scheduler, the implementation goals will be low-latency and load balancing. (It would be much greater if the SLA scheme is supported.)
          Since you take over this issue from Min Zhou after quite a long time has passed, I wondered your intention.

          Anyway, you seem to implement a Sparrow-like query scheduler.
          Is it a just porting of Sparrow to Tajo?
          Even though it is, it would be great for us to understand your implementation if you share your problem solving approach at a logical level, instead of implementation level.

          Show
          jihoonson Jihoon Son added a comment - Jinho Kim , thanks for your reply. I meant, we need to talk about implementation goals, problems which can be introduced in the implementation, and solutions of those problems. If you want to implement a Sparrow-like query scheduler, the implementation goals will be low-latency and load balancing. (It would be much greater if the SLA scheme is supported.) Since you take over this issue from Min Zhou after quite a long time has passed, I wondered your intention. Anyway, you seem to implement a Sparrow-like query scheduler. Is it a just porting of Sparrow to Tajo? Even though it is, it would be great for us to understand your implementation if you share your problem solving approach at a logical level, instead of implementation level.
          Hide
          jihoonson Jihoon Son added a comment -

          I had a discussion with Jinho Kim in offline.
          I misunderstood his simple plan which is described above as his implementation plan, but it was not.
          So, I'll wait for his sequence diagram that describes the overall behavior of his suggestion.
          Thanks for valuable discussion, Jinho Kim.

          Show
          jihoonson Jihoon Son added a comment - I had a discussion with Jinho Kim in offline. I misunderstood his simple plan which is described above as his implementation plan, but it was not. So, I'll wait for his sequence diagram that describes the overall behavior of his suggestion. Thanks for valuable discussion, Jinho Kim .
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          Jinho Kim,

          Sounds nice. I'm looking forward to seeing your sequence diagram. We can start the discussion from your initial work.

          The main goal of is to support multi-tenancy and capacity scheduling for queries. I missed to mention it in the description. I'll rewrite the issue description.

          Show
          hyunsik Hyunsik Choi added a comment - - edited Jinho Kim , Sounds nice. I'm looking forward to seeing your sequence diagram. We can start the discussion from your initial work. The main goal of is to support multi-tenancy and capacity scheduling for queries. I missed to mention it in the description. I'll rewrite the issue description.
          Hide
          jhkim Jinho Kim added a comment -

          Thanks Jihoon Son, Hyunsik Choi
          Actually, this issue is huge. So I focus on task scheduling in this time.

          Show
          jhkim Jinho Kim added a comment - Thanks Jihoon Son , Hyunsik Choi Actually, this issue is huge. So I focus on task scheduling in this time.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Task scheduling and query scheduling mostly are related to each other. Don't worry about it. I can help this work, and I'd like participate in this work. Let's do it together.

          Show
          hyunsik Hyunsik Choi added a comment - Task scheduling and query scheduling mostly are related to each other. Don't worry about it. I can help this work, and I'd like participate in this work. Let's do it together.
          Hide
          jhkim Jinho Kim added a comment - - edited

          Hello guys

          Recently I’m looking a solution for capacity scheduler on WorkerNode. But it is very difficult and many challenges to me.
          Jihoon suggested omega scheduler to me. It very interesting and will be help to implement our scheduler
          http://research.google.com/pubs/pub41684.html

          As first, I suggest to borrow yarn scheduler queue into Tajo. It will be easy to add the multi-tenant query scheduler
          What do you think about to borrow the yarn scheduler queue for initial implementation?

          Show
          jhkim Jinho Kim added a comment - - edited Hello guys Recently I’m looking a solution for capacity scheduler on WorkerNode. But it is very difficult and many challenges to me. Jihoon suggested omega scheduler to me. It very interesting and will be help to implement our scheduler http://research.google.com/pubs/pub41684.html As first, I suggest to borrow yarn scheduler queue into Tajo. It will be easy to add the multi-tenant query scheduler What do you think about to borrow the yarn scheduler queue for initial implementation?
          Hide
          jihoonson Jihoon Son added a comment -

          Jinho Kim

          It looks good to start with providing a multi-tenant query scheduler by borrowing the Yarn scheduler.
          However, as you know, the Yarn scheduler has some problems such as high scheduling latency.
          Do you have any good idea?

          Show
          jihoonson Jihoon Son added a comment - Jinho Kim It looks good to start with providing a multi-tenant query scheduler by borrowing the Yarn scheduler. However, as you know, the Yarn scheduler has some problems such as high scheduling latency. Do you have any good idea?
          Hide
          jhkim Jinho Kim added a comment -

          Jihoon Son

          I think that refactoring the current resource manager is a good approach. This model shows reasonable low latency (< 100ms).
          But, we must fix the leak of resource problem we've found.

          Show
          jhkim Jinho Kim added a comment - Jihoon Son I think that refactoring the current resource manager is a good approach. This model shows reasonable low latency (< 100ms). But, we must fix the leak of resource problem we've found.
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          Basically, our query scheduler is different from yarn's objective. In my opinion, we just need to refer to the multi-queue fair scheduler policy.

          Show
          hyunsik Hyunsik Choi added a comment - - edited Basically, our query scheduler is different from yarn's objective. In my opinion, we just need to refer to the multi-queue fair scheduler policy.
          Hide
          jihoonson Jihoon Son added a comment -

          I got you point.
          Please go ahead.

          Show
          jihoonson Jihoon Son added a comment - I got you point. Please go ahead.
          Hide
          jhkim Jinho Kim added a comment -

          Thank you for the comments
          Hyunsik Choi You're right. I’ve just to refer to the scheduler policy
          Firstly, I will do start simulation about resource manager with scheduler policy. (fifo and fair)

          Show
          jhkim Jinho Kim added a comment - Thank you for the comments Hyunsik Choi You're right. I’ve just to refer to the scheduler policy Firstly, I will do start simulation about resource manager with scheduler policy. (fifo and fair)
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          In general, query (or job) scheduler aims at the maximum resource utilization. For multi-tenancy, we also need to consider the fairness for multiple users (or queries). BTW, the maximum resource utilization and fairness are usually conflict to each other in many cases. To mitigate this problem, many scheduler seems to use preemption approach.

          In this point, our resource and scheduler system has the following problems:

          • A query exclusively uses allocated resources at the first time until the query is completed or failed.
          • There is no mechanism to deallocate resources during query processing.
          • Preempt is also not allowed.

          To achieve the multi tenancy, we should change our resource circulation. Especially, resource allocation must be fine grained instead of per query.

          So, I'll create a jira issue to change the resource circulation. We have to do this issue firstly in my opinion. If we achieve this, implementing multi-tenant scheduler would be much easier than now. It would be a good starting point of this issue.

          Show
          hyunsik Hyunsik Choi added a comment - - edited In general, query (or job) scheduler aims at the maximum resource utilization. For multi-tenancy, we also need to consider the fairness for multiple users (or queries). BTW, the maximum resource utilization and fairness are usually conflict to each other in many cases. To mitigate this problem, many scheduler seems to use preemption approach. In this point, our resource and scheduler system has the following problems: A query exclusively uses allocated resources at the first time until the query is completed or failed. There is no mechanism to deallocate resources during query processing. Preempt is also not allowed. To achieve the multi tenancy, we should change our resource circulation. Especially, resource allocation must be fine grained instead of per query. So, I'll create a jira issue to change the resource circulation. We have to do this issue firstly in my opinion. If we achieve this, implementing multi-tenant scheduler would be much easier than now. It would be a good starting point of this issue.
          Hide
          hyunsik Hyunsik Choi added a comment -

          I created the above issue at TAJO-1397.

          Show
          hyunsik Hyunsik Choi added a comment - I created the above issue at TAJO-1397 .
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          I changed sparrow-like scheduler to 'WISH' rather than sub task because we do not decide to use it yet.

          Show
          hyunsik Hyunsik Choi added a comment - - edited I changed sparrow-like scheduler to 'WISH' rather than sub task because we do not decide to use it yet.
          Hide
          jhkim Jinho Kim added a comment -

          Hyunsik Choi
          This is a good approach for the multi tenancy.

          Show
          jhkim Jinho Kim added a comment - Hyunsik Choi This is a good approach for the multi tenancy.

            People

            • Assignee:
              Unassigned
              Reporter:
              hyunsik Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development