We have noticed that sometimes the C calls like getpwuid_r ends up making direct calls to the ldap server. It probably is configuration/environment specific, but in Yahoo! the password entries are maintained by the ldap server. In order to prevent ldap servers from getting overloaded with password look-ups, we have a daemon called nscd run on all the compute nodes, that caches the results of such look-ups. The calls such as getpwuid_r should terminate at the local nscd daemon, but if, for whatever reason, the nscd daemon is down on the node, the calls end up talking to the ldap server directly. Apparently, nscd is not that stable...
We have seen the above happening at Yahoo! and in a couple of occasions brought down the ldap servers. So I was wondering whether we should reduce the number of calls to the getpwuid_r and such by caching the resolutions
in Hadoop.. Thoughts?