Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-14708

LDAP Requests Via nslcd Take Too Long In Some Organizations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.1
    • None
    • None

    Description

      When performing a restart of a large cluster where LDAP is being used
      indirectly by nslcd, the LDAP servers are put under heavy load. This is more
      evident in LDAP organizations that are large to begin with.

      connection from pid=12345 uid=0 gid=0
      nslcd_group_all()
      myldap_search(base="cn=groups,cn=accounts,dc=corp,dc=local",
      filter="(objectClass=posixGroup)")
      ldap_result(): end of results

      It turns out that these processes are the before-ANY hook script which runs when a service is started, like this one I was running locally to reproduce the query patterns.

      /usr/bin/python2.6 /var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
      ANY/scripts/hook.py ANY /var/lib/ambari-agent/data/command-5950.json /var/lib
      /ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY /var/lib/ambari-
      agent/data/structured-out-5950.json INFO /var/lib/ambari-agent/data/tmp

      I tracked the issue down to this function in resource_management/core/providers/accounts.py:

      @property
      def user_groups(self):
      return [g.gr_name for g in grp.getgrall() if self.resource.username in g.gr_me

      This property actually gets referenced at least 2 times for each user. The call to grp.getgrall() forces a complete enumeration of groups every time.

      What this means is for a cluster with many nodes with many processes restarting across those nodes you are going to have many of these full enumeration searches running at the same time. In an enterprise with a large directory this will get very expensive, especially since this type of call is not cached by nscd.

      I'm aware that the idiom used here to get the groups is common in python but it's actually pretty inefficient. Commands like id and groups have more efficient ways of discovering this. I'm not aware of the equivalent of these in Python.

      @property
      def user_groups(self):
      ret = []
      (rc, output) = shell.checked_call(['groups', self.resource.username](https://h
      sudo=True)
      if rc == 0:
      ret.extend(output.split(':')[1](
      ).lstrip().split())
      return ret

      This converts the full LDAP scan for groups to more efficient queries targeted
      to the user. The lookups done by the groups command are also 100% cacheable.
      Since it's a checked call the `rc == 0` check is probably not needed.

      An unfortunate effect of how usermod and friends work is that it always
      invalidates the nscd cache after it's run. This means that Ambari could still
      be a lot more efficient than it is when LDAP is in play by being pickier about
      when it runs commands like useradd/usermod/groupadd/groupmod.

      We can also probably put a timed cache on the results from `grp.getgrall()` or
      `groups` in memory, configurable by the agent config file. This way, we would
      only call it once every hour or so.

      Attachments

        1. AMBARI-14708.patch
          12 kB
          Andrew Onischuk

        Issue Links

          Activity

            People

              aonishuk Andrew Onischuk
              aonishuk Andrew Onischuk
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: