Hadoop Common
  1. Hadoop Common
  2. HADOOP-10442

Group look-up can cause segmentation fault when certain JNI-based mapping module is used.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When JniBasedUnixGroupsNetgroupMapping or JniBasedUnixGroupsMapping is used, we get segmentation fault very often. The same system ran 2.2 for months without any problem, but as soon as upgrading to 2.3, it started crashing. This resulted in multiple name node crashes per day.

      The server was running nslcd (nss-pam-ldapd-0.7.5-15.el6_3.2). We did not see this problem on the servers running sssd.

      There was one change in the C code and it modified the return code handling after getgrouplist() call. If the function returns 0 or a negative value less than -1, it will do realloc() instead of returning failure.

        Activity

        Hide
        Kihwal Lee added a comment -

        The return code handling was modified in HADOOP-10087. This is the only change in the JNI user-group mapping modules between 2.2 and 2.3.

        Show
        Kihwal Lee added a comment - The return code handling was modified in HADOOP-10087 . This is the only change in the JNI user-group mapping modules between 2.2 and 2.3.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12636983/HADOOP-10442.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3721//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3721//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636983/HADOOP-10442.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3721//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3721//console This message is automatically generated.
        Hide
        Chris Nauroth added a comment -

        +1 for the patch. Thank you, Kihwal.

        -1 tests included. The patch doesn't appear to include any new or modified tests.

        This is not something easily covered with an automated test. Have you been able to verify the patch manually in one your clusters that had the segfaults?

        Show
        Chris Nauroth added a comment - +1 for the patch. Thank you, Kihwal. -1 tests included. The patch doesn't appear to include any new or modified tests. This is not something easily covered with an automated test. Have you been able to verify the patch manually in one your clusters that had the segfaults?
        Hide
        Kihwal Lee added a comment -

        A 2.3 NN has been running with this fix for some time. The NN crashed every 3-5 hours before this.

        Show
        Kihwal Lee added a comment - A 2.3 NN has been running with this fix for some time. The NN crashed every 3-5 hours before this.
        Hide
        Chris Nauroth added a comment -

        That sounds great, Kihwal. I think we can commit this and resolve the blocker.

        Show
        Chris Nauroth added a comment - That sounds great, Kihwal. I think we can commit this and resolve the blocker.
        Hide
        Jonathan Eagles added a comment -

        +1. Checking this into trunk, branch-2, branch-2.4

        Show
        Jonathan Eagles added a comment - +1. Checking this into trunk, branch-2, branch-2.4
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #5418 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5418/)
        HADOOP-10442. Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5418 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5418/ ) HADOOP-10442 . Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Hide
        Colin Patrick McCabe added a comment -

        Thanks for this patch, Kihwal.

        To be honest, I find the behavior of getgrouplist that you are seeing to be puzzling. The man page doesn't describe any negative return codes other than -1.

        RETURN VALUE
               If the number of groups of which user is a member is less than or equal to *ngroups, then the value *ngroups is returned.
        
               If the user is a member of more than *ngroups groups, then getgrouplist() returns -1.  In this case the value returned in *ngroups can  be  used  to
               resize the buffer passed to a further call getgrouplist().
        

        What negative return code did you see besides -1? I guess what you're seeing is undocumented, and possibly a bug in nslcd (or the man page?)

        Also, looking at this more closely, I believe we mishandle the case where the user is a member of no groups. This would be a pretty odd configuration (I wonder if it's possible?). Just to be sure, I think we should consider getgrouplist returning 0 to be ok.

        Show
        Colin Patrick McCabe added a comment - Thanks for this patch, Kihwal. To be honest, I find the behavior of getgrouplist that you are seeing to be puzzling. The man page doesn't describe any negative return codes other than -1. RETURN VALUE If the number of groups of which user is a member is less than or equal to *ngroups, then the value *ngroups is returned. If the user is a member of more than *ngroups groups, then getgrouplist() returns -1. In this case the value returned in *ngroups can be used to resize the buffer passed to a further call getgrouplist(). What negative return code did you see besides -1? I guess what you're seeing is undocumented, and possibly a bug in nslcd (or the man page?) Also, looking at this more closely, I believe we mishandle the case where the user is a member of no groups. This would be a pretty odd configuration (I wonder if it's possible?). Just to be sure, I think we should consider getgrouplist returning 0 to be ok.
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #523 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/523/)
        HADOOP-10442. Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #523 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/523/ ) HADOOP-10442 . Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #1715 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1715/)
        HADOOP-10442. Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1715 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1715/ ) HADOOP-10442 . Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #1740 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1740/)
        HADOOP-10442. Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1740 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1740/ ) HADOOP-10442 . Group look-up can cause segmentation fault when certain JNI-based mapping module is used. (Kihwal Lee via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1582451 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/hadoop_user_info.c
        Hide
        Kihwal Lee added a comment -

        Colin Patrick McCabe: I also think the version of nslcd we used is buggy. The return code handling before your change was just masking it, but it likely had other side effects. I observed many lookup timeouts in NN prior to crashes, while my own program calling the same libc functions running on the same box at the same time had no issue. The nslcd lookup timeout was configured to be 20 seconds in /etc/nslcd.conf.

        12:15:21,106 WARN security.Groups: Potential performance problem:
        getGroups(user=xxxx) took 20020 milliseconds.
        12:15:21,107 WARN security.UserGroupInformation: No groups available for user xxxx

        Also, looking at this more closely, I believe we mishandle the case where the user is a member of no groups. This would be a pretty odd configuration (I wonder if it's possible?).

        Getting no groups after a successful getpwnam() can probably only happen when the user was removed in between the two calls. All other cases might be considered as errors. I saw cases of an admin user getting permission refused for certain operations. It was fixed after the refresh command was issued. It must have hit the no-group error when building the acl and the result was negatively cached. If it didn't do negative caching, user-level retries would have worked.

        So, the solution might be letting the native code return 0 even on error conditions as you suggested, but making netgroup modules not do negative caching. That's when a valid user name has no netgroups.

        Show
        Kihwal Lee added a comment - Colin Patrick McCabe : I also think the version of nslcd we used is buggy. The return code handling before your change was just masking it, but it likely had other side effects. I observed many lookup timeouts in NN prior to crashes, while my own program calling the same libc functions running on the same box at the same time had no issue. The nslcd lookup timeout was configured to be 20 seconds in /etc/nslcd.conf. 12:15:21,106 WARN security.Groups: Potential performance problem: getGroups(user=xxxx) took 20020 milliseconds. 12:15:21,107 WARN security.UserGroupInformation: No groups available for user xxxx Also, looking at this more closely, I believe we mishandle the case where the user is a member of no groups. This would be a pretty odd configuration (I wonder if it's possible?). Getting no groups after a successful getpwnam() can probably only happen when the user was removed in between the two calls. All other cases might be considered as errors. I saw cases of an admin user getting permission refused for certain operations. It was fixed after the refresh command was issued. It must have hit the no-group error when building the acl and the result was negatively cached. If it didn't do negative caching, user-level retries would have worked. So, the solution might be letting the native code return 0 even on error conditions as you suggested, but making netgroup modules not do negative caching. That's when a valid user name has no netgroups.

          People

          • Assignee:
            Kihwal Lee
            Reporter:
            Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development