Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9640

RPC Congestion Control with FairCallQueue

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0, 3.0.0-alpha1
    • None
    • None
    • Enable optional RPC-level priority to combat congestion and make request latencies more consistent.

    Description

      For an easy-to-read summary see: http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/

      Several production Hadoop cluster incidents occurred where the Namenode was overloaded and failed to respond.

      We can improve quality of service for users during namenode peak loads by replacing the FIFO call queue with a Fair Call Queue. (this plan supersedes rpc-congestion-control-draft-plan).

      Excerpted from the communication of one incident, “The map task of a user was creating huge number of small files in the user directory. Due to the heavy load on NN, the JT also was unable to communicate with NN...The cluster became responsive only once the job was killed.”

      Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo for a nonexistent file in an endless loop). All other requests to namenode were also affected by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually killed jobtracker to kill all jobs that are running.”

      Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”

      Attachments

        1. faircallqueue.patch
          41 kB
          Chris Li
        2. faircallqueue2.patch
          73 kB
          Chris Li
        3. faircallqueue3.patch
          73 kB
          Chris Li
        4. faircallqueue4.patch
          74 kB
          Chris Li
        5. faircallqueue5.patch
          73 kB
          Chris Li
        6. faircallqueue6.patch
          74 kB
          Chris Li
        7. faircallqueue7_with_runtime_swapping.patch
          134 kB
          Chris Li
        8. FairCallQueue-PerformanceOnCluster.pdf
          694 kB
          Chris Li
        9. MinorityMajorityPerformance.pdf
          72 kB
          Chris Li
        10. NN-denial-of-service-updated-plan.pdf
          2.76 MB
          Chris Li
        11. rpc-congestion-control-draft-plan.pdf
          488 kB
          Xiaobo Peng

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              chrilisf Chris Li
              teledriver Xiaobo Peng
              Votes:
              5 Vote for this issue
              Watchers:
              91 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: