Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9640

RPC Congestion Control with FairCallQueue

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0, 3.0.0-alpha1
    • None
    • None
    • Enable optional RPC-level priority to combat congestion and make request latencies more consistent.

    Description

      For an easy-to-read summary see: http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/

      Several production Hadoop cluster incidents occurred where the Namenode was overloaded and failed to respond.

      We can improve quality of service for users during namenode peak loads by replacing the FIFO call queue with a Fair Call Queue. (this plan supersedes rpc-congestion-control-draft-plan).

      Excerpted from the communication of one incident, “The map task of a user was creating huge number of small files in the user directory. Due to the heavy load on NN, the JT also was unable to communicate with NN...The cluster became responsive only once the job was killed.”

      Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo for a nonexistent file in an endless loop). All other requests to namenode were also affected by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually killed jobtracker to kill all jobs that are running.”

      Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”

      Attachments

        1. rpc-congestion-control-draft-plan.pdf
          488 kB
          Xiaobo Peng
        2. faircallqueue.patch
          41 kB
          Chris Li
        3. NN-denial-of-service-updated-plan.pdf
          2.76 MB
          Chris Li
        4. MinorityMajorityPerformance.pdf
          72 kB
          Chris Li
        5. faircallqueue2.patch
          73 kB
          Chris Li
        6. faircallqueue3.patch
          73 kB
          Chris Li
        7. faircallqueue4.patch
          74 kB
          Chris Li
        8. faircallqueue5.patch
          73 kB
          Chris Li
        9. faircallqueue6.patch
          74 kB
          Chris Li
        10. faircallqueue7_with_runtime_swapping.patch
          134 kB
          Chris Li
        11. FairCallQueue-PerformanceOnCluster.pdf
          694 kB
          Chris Li

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            chrilisf Chris Li
            teledriver Xiaobo Peng
            Votes:
            5 Vote for this issue
            Watchers:
            94 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment