Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16284

KMS Cache Miss Storm

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.6.0
    • None
    • kms
    • None
    • CDH 5.13.1, Kerberized, Cloudera Keytrustee Server

    Description

      We recently stumble upon a performance issue with KMS, where occasionally it exhibited "No content to map" error (this cluster ran an old version that doesn't have HADOOP-14841) and jobs crashed. We bumped the number of KMSes from 2 to 4, and situation went even worse.

      Later, we realized this cluster had a few hundred encryption zones and a few hundred encryption keys. This is pretty unusual because most of the deployments known to us has at most a dozen keys. So in terms of number of keys, this cluster is 1-2 order of magnitude higher than any one else.

      The high number of encryption keys in creases the likelihood of key cache miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its backend, the Cloudera Keytrustee Server. Plus the high number of KMSes amplifies the latency, effectively causing a cache miss storm.

      We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will come up with a better name later surely - and discovered a scalability bug in CKTS. The fix was verified again with the tool.

      Filing this bug so the community is aware of this issue. I don't have a solution for now in KMS. But we want to address this scalability problem in the near future because we are seeing use cases that requires thousands of encryption keys.


      On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote cluster. I imagine similar issues exist for other execution engines, but I didn't test.

      Attachments

        1. 4 kms, no KTS patch.png
          114 kB
          Wei-Chiu Chuang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            weichiu Wei-Chiu Chuang

            Dates

              Created:
              Updated:

              Slack

                Issue deployment