Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      RHEL 6.0 "Santiago"

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Adds a new configuration hadoop.work.around.non.threadsafe.getpwuid which can be used to enable a mutex around this call to workaround thread-unsafe implementations of getpwuid_r. Users should consult http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations for a list of such systems.
      Show
      Adds a new configuration hadoop.work.around.non.threadsafe.getpwuid which can be used to enable a mutex around this call to workaround thread-unsafe implementations of getpwuid_r. Users should consult http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations for a list of such systems.

      Description

      Due to the following bug in SSSD, functions like getpwuid_r are not thread-safe in RHEL 6.0 if sssd is specified in /etc/nsswitch.conf (as it is by default):

      https://fedorahosted.org/sssd/ticket/640

      This causes many fetch failures in the case that the native libraries are available, since the SecureIO functions call getpwuid_r as part of fstat. By enabling -Xcheck:jni I get the following trace on JVM crash:

          • glibc detected *** /mnt/toolchain/JDK6u20-64bit/bin/java: free(): invalid pointer: 0x0000003575741d23 ***
            ======= Backtrace: =========
            /lib64/libc.so.6[0x3575675676]
            /lib64/libnss_sss.so.2(_nss_sss_getpwuid_r+0x11b)[0x7fe716cb42cb]
            /lib64/libc.so.6(getpwuid_r+0xdd)[0x35756a5dfd]
      1. hadoop-7156.txt
        5 kB
        Todd Lipcon
      2. hadoop-7156.txt
        7 kB
        Todd Lipcon
      3. hadoop-7156.txt
        8 kB
        Todd Lipcon
      4. hadoop-7156.txt
        8 kB
        Todd Lipcon

        Activity

        Todd Lipcon created issue -
        Hide
        Todd Lipcon added a comment -

        Filed a RHEL bug at: https://bugzilla.redhat.com/show_bug.cgi?id=680367

        In the meantime we might consider adding a mutex around this call which can be #ifdeffed in on this platform as a workaround.

        Show
        Todd Lipcon added a comment - Filed a RHEL bug at: https://bugzilla.redhat.com/show_bug.cgi?id=680367 In the meantime we might consider adding a mutex around this call which can be #ifdeffed in on this platform as a workaround.
        Hide
        Todd Lipcon added a comment -

        Here's a heinous workaround for this issue:

        When initializing the native library, we check for the problematic nss module - if it's on the system, we're likely to run into this issue, so it adds a lock around the getpwuid_r calls.

        Do people think this hack is the right approach? My other thought was to add a new boolean config flag like hadoop.nativeio.workaround.hadoop7156 or some nonsense like that, and if that flag is set, add the monitor lock.

        Show
        Todd Lipcon added a comment - Here's a heinous workaround for this issue: When initializing the native library, we check for the problematic nss module - if it's on the system, we're likely to run into this issue, so it adds a lock around the getpwuid_r calls. Do people think this hack is the right approach? My other thought was to add a new boolean config flag like hadoop.nativeio.workaround.hadoop7156 or some nonsense like that, and if that flag is set, add the monitor lock.
        Todd Lipcon made changes -
        Field Original Value New Value
        Attachment hadoop-7156.txt [ 12472471 ]
        Hide
        Todd Lipcon added a comment -

        Another possibility is to look in /etc/redhat-release and match the contents against known-buggy releases.

        Show
        Todd Lipcon added a comment - Another possibility is to look in /etc/redhat-release and match the contents against known-buggy releases.
        Hide
        Allen Wittenauer added a comment -

        Again, we're patching around a particular bug in a particular OS, just as we did in HADOOP-7115. This trend is very disturbing.

        If we really want to go down this road, then yes, this should be compile time. But for now, I'd rather just advertise "RHEL 6.0 is broken; don't use it" just like we do for JREs.

        Show
        Allen Wittenauer added a comment - Again, we're patching around a particular bug in a particular OS, just as we did in HADOOP-7115 . This trend is very disturbing. If we really want to go down this road, then yes, this should be compile time. But for now, I'd rather just advertise "RHEL 6.0 is broken; don't use it" just like we do for JREs.
        Hide
        Todd Lipcon added a comment -

        Apparently no one out there who implements pam modules knows that getpwuid_r is supposed to be reentrant A user saw this same issue with VAS:

        /lib64/libc.so.6[0x3757c7230f]
        /lib64/libc.so.6(cfree+0x4b)[0x3757c7276b]
        /opt/quest/lib64/libvas.so.4(vassym_sqlite3FreeX+0xe)[0x2aaaab2bd15e]
        /opt/quest/lib64/libvas.so.4[0x2aaaab2ac9b9]
        /opt/quest/lib64/libvas.so.4(vassym_sqlite3OsClose+0x93)[0x2aaaab2ad6d3]
        /opt/quest/lib64/libvas.so.4(vassym_sqlite3pager_close+0x4d)[0x2aaaab2af1cd]
        /opt/quest/lib64/libvas.so.4(vassym_sqlite3BtreeClose+0x15)[0x2aaaab29a235]
        /opt/quest/lib64/libvas.so.4(vassym_sqlite3_close+0x15b)[0x2aaaab2ab62b]
        /opt/quest/lib64/libvas.so.4(vassql_deinit+0xdf)[0x2aaaab3158de]
        /opt/quest/lib64/libvas.so.4[0x2aaaab2f3ef9]
        /opt/quest/lib64/libvas.so.4(vascache_deinit+0x30)[0x2aaaab2f438f]
        /lib64/libnss_vas3.so.2[0x2aaaab027240]
        /lib64/libnss_vas3.so.2(_nss_vas3_getXXent_release_tsd+0x10e)[0x2aaaab0284ff]
        /lib64/libnss_vas3.so.2[0x2aaaab02c3c0]
        /lib64/libnss_vas3.so.2(_nss_vas3_getpwuid_r+0x36)[0x2aaaab02d242]
        /lib64/libc.so.6(getpwuid_r+0xa5)[0x3757c99515]

        Show
        Todd Lipcon added a comment - Apparently no one out there who implements pam modules knows that getpwuid_r is supposed to be reentrant A user saw this same issue with VAS: /lib64/libc.so.6 [0x3757c7230f] /lib64/libc.so.6(cfree+0x4b) [0x3757c7276b] /opt/quest/lib64/libvas.so.4(vassym_sqlite3FreeX+0xe) [0x2aaaab2bd15e] /opt/quest/lib64/libvas.so.4 [0x2aaaab2ac9b9] /opt/quest/lib64/libvas.so.4(vassym_sqlite3OsClose+0x93) [0x2aaaab2ad6d3] /opt/quest/lib64/libvas.so.4(vassym_sqlite3pager_close+0x4d) [0x2aaaab2af1cd] /opt/quest/lib64/libvas.so.4(vassym_sqlite3BtreeClose+0x15) [0x2aaaab29a235] /opt/quest/lib64/libvas.so.4(vassym_sqlite3_close+0x15b) [0x2aaaab2ab62b] /opt/quest/lib64/libvas.so.4(vassql_deinit+0xdf) [0x2aaaab3158de] /opt/quest/lib64/libvas.so.4 [0x2aaaab2f3ef9] /opt/quest/lib64/libvas.so.4(vascache_deinit+0x30) [0x2aaaab2f438f] /lib64/libnss_vas3.so.2 [0x2aaaab027240] /lib64/libnss_vas3.so.2(_nss_vas3_getXXent_release_tsd+0x10e) [0x2aaaab0284ff] /lib64/libnss_vas3.so.2 [0x2aaaab02c3c0] /lib64/libnss_vas3.so.2(_nss_vas3_getpwuid_r+0x36) [0x2aaaab02d242] /lib64/libc.so.6(getpwuid_r+0xa5) [0x3757c99515]
        Hide
        Todd Lipcon added a comment -

        But for now, I'd rather just advertise "RHEL 6.0 is broken; don't use it" just like we do for JREs.

        Unfortunately many of us are not in a position to do this - RHEL 6.0 is a must for us, regardless of some bugs it might have. Same with support for Vintela Authentication Services (VAS) which has a similar bug. Asking users to switch their entire OS or auth system is not an option.

        I think there are three workable options here from my perspective:

        1) Always lock around getpwuid_r. Devaraj is added a cache for this function as part of another JIRA, so it shouldn't be a big performance issue.
        2) Add a compile-time macro like LOCK_AROUND_PWUID, which, when set, will add the monitor lock around these calls.
        3) Add a runtime Hadoop configuration option like hadoop.workaround.broken.getpwuid, which when enabled adds the lock.

        Which, if any, of these seem acceptable to you?

        Since we've found that this isn't RHEL6 specific, but in fact occurs with other broken pieces of software as well, I'm leaning towards option #1 or #3.

        Show
        Todd Lipcon added a comment - But for now, I'd rather just advertise "RHEL 6.0 is broken; don't use it" just like we do for JREs. Unfortunately many of us are not in a position to do this - RHEL 6.0 is a must for us, regardless of some bugs it might have. Same with support for Vintela Authentication Services (VAS) which has a similar bug. Asking users to switch their entire OS or auth system is not an option. I think there are three workable options here from my perspective: 1) Always lock around getpwuid_r. Devaraj is added a cache for this function as part of another JIRA, so it shouldn't be a big performance issue. 2) Add a compile-time macro like LOCK_AROUND_PWUID, which, when set, will add the monitor lock around these calls. 3) Add a runtime Hadoop configuration option like hadoop.workaround.broken.getpwuid, which when enabled adds the lock. Which, if any, of these seem acceptable to you? Since we've found that this isn't RHEL6 specific, but in fact occurs with other broken pieces of software as well, I'm leaning towards option #1 or #3.
        Hide
        Allen Wittenauer added a comment -

        If someone wants to use non-conforming software, that is their problem, not ours. I'd much rather shame a company into fixing it than pretending that everything is a-ok. Yes, I realize that is idealistic.

        In the case of RHEL 6, it is what? 2 months old? Like any .0, it is going to have growing pains during its first few months that should get fixed by the vendor. As for VAS, I'm not really sure I care.

        In the case of #1, what is the point of mutexes around a thread-safe function? Why not just call the unsafe one and avoid the extra performance hit on machines that follow POSIX?

        In the case of #3, I'll be OK with this if we also add a huge warning in the task log telling the user that their system isn't POSIX compliant and to call their vendor to get it fixed.

        Show
        Allen Wittenauer added a comment - If someone wants to use non-conforming software, that is their problem, not ours. I'd much rather shame a company into fixing it than pretending that everything is a-ok. Yes, I realize that is idealistic. In the case of RHEL 6, it is what? 2 months old? Like any .0, it is going to have growing pains during its first few months that should get fixed by the vendor. As for VAS, I'm not really sure I care. In the case of #1, what is the point of mutexes around a thread-safe function? Why not just call the unsafe one and avoid the extra performance hit on machines that follow POSIX? In the case of #3, I'll be OK with this if we also add a huge warning in the task log telling the user that their system isn't POSIX compliant and to call their vendor to get it fixed.
        Hide
        Todd Lipcon added a comment -

        I'll assume you're joking about the "huge warning" and implement option #3. Thanks Allen.

        Show
        Todd Lipcon added a comment - I'll assume you're joking about the "huge warning" and implement option #3. Thanks Allen.
        Hide
        Todd Lipcon added a comment -

        Here's a patch which implements the configuration option.

        I did not add it to core-default since it's a corner case... does that seem reasonable or should it be documented?

        Show
        Todd Lipcon added a comment - Here's a patch which implements the configuration option. I did not add it to core-default since it's a corner case... does that seem reasonable or should it be documented?
        Todd Lipcon made changes -
        Attachment hadoop-7156.txt [ 12472902 ]
        Todd Lipcon made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Greg Roelofs added a comment -

        I did not add it to core-default since it's a corner case... does that seem reasonable or should it be documented?

        Please please please...always document things like this. Picture it from the perspective of some new user who hasn't discovered the lists yet (or hasn't been subscribed for very long, and then only on the -user list(s)): they know they're running RHEL-6 and probably will notice a config item that explicitly mentions it (which I'd recommend, along with CentOS and any other known distro versions), but they're much less likely to know what's up with a random glibc crash several months later.

        Show
        Greg Roelofs added a comment - I did not add it to core-default since it's a corner case... does that seem reasonable or should it be documented? Please please please...always document things like this. Picture it from the perspective of some new user who hasn't discovered the lists yet (or hasn't been subscribed for very long, and then only on the -user list(s)): they know they're running RHEL-6 and probably will notice a config item that explicitly mentions it (which I'd recommend, along with CentOS and any other known distro versions), but they're much less likely to know what's up with a random glibc crash several months later.
        Hide
        Todd Lipcon added a comment -

        Updated version includes docs in core-default.xml

        Do you agree with the current test case? It will fail with JVM exit on RHEL6 currently. Should we set the workaround variable to true in the test settings?

        Another option is to add a bit of code to apply some regexes to /etc/redhat-release and enable by default if running RHEL or CentOS 6.

        Show
        Todd Lipcon added a comment - Updated version includes docs in core-default.xml Do you agree with the current test case? It will fail with JVM exit on RHEL6 currently. Should we set the workaround variable to true in the test settings? Another option is to add a bit of code to apply some regexes to /etc/redhat-release and enable by default if running RHEL or CentOS 6.
        Todd Lipcon made changes -
        Attachment hadoop-7156.txt [ 12473030 ]
        Hide
        Allen Wittenauer added a comment -

        > have broken implementations of

        Let's call it for what it is: POSIX non-compliance.

        > Systems known to exhibit this issue include Red Hat Enterprise Linux 6.0,
        > CentOS 6.0, and Vintela Authentication Services.

        Why is this text in here? What happens when RHEL 6 gets this fixed? Do we make a new release to pull the text out? To me, it makes much more sense to include a link to a wiki page where we can keep the information up-to-date.

        Show
        Allen Wittenauer added a comment - > have broken implementations of Let's call it for what it is: POSIX non-compliance. > Systems known to exhibit this issue include Red Hat Enterprise Linux 6.0, > CentOS 6.0, and Vintela Authentication Services. Why is this text in here? What happens when RHEL 6 gets this fixed? Do we make a new release to pull the text out? To me, it makes much more sense to include a link to a wiki page where we can keep the information up-to-date.
        Hide
        Todd Lipcon added a comment -

        What happens when RHEL 6 gets this fixed

        That will be RHEL 6.1, not RHEL 6.0. So this should remain accurate.
        Happy to change "Vintela Authentication Services" to "some version of Vintela Authentication Services".

        it makes much more sense to include a link to a wiki page where we can keep the information up-to-date

        In my experience, we do a really bad job of keeping the wiki up to date. Greg, what do you think?

        Show
        Todd Lipcon added a comment - What happens when RHEL 6 gets this fixed That will be RHEL 6.1, not RHEL 6.0. So this should remain accurate. Happy to change "Vintela Authentication Services" to "some version of Vintela Authentication Services". it makes much more sense to include a link to a wiki page where we can keep the information up-to-date In my experience, we do a really bad job of keeping the wiki up to date. Greg, what do you think?
        Hide
        Eli Collins added a comment -

        The approach and code look good to me. I would leave as is, no need to add autodetection, but we should document it in core-default so users are aware of the issue.

        I think leaving the workaround disabled for the tests is the right thing, this way it fails on a broken system which will help users identify that they need to enable the workaround.

        Also, the description seems appropriate to me, this isn't about POSIX compliance, it's just a bug in a library function, that will probably be fixed.

        Show
        Eli Collins added a comment - The approach and code look good to me. I would leave as is, no need to add autodetection, but we should document it in core-default so users are aware of the issue. I think leaving the workaround disabled for the tests is the right thing, this way it fails on a broken system which will help users identify that they need to enable the workaround. Also, the description seems appropriate to me, this isn't about POSIX compliance, it's just a bug in a library function, that will probably be fixed.
        Hide
        Greg Roelofs added a comment -

        Doh! FF crashed while I was replying, sigh. Switching to e-mail:

        In my experience, we do a really bad job of keeping the wiki up to date. Greg, what do you think?

        I agree--we're much better at keeping the code up to date (frequently in
        parallel across multiple branches ) than at keeping the wiki current.

        I think the XML config text is fine; you could optionally prefix it with
        "As of March 2011, systems known to ..." as a hint to users or future versions
        of us to recheck it if significant time has passed. The comment in NativeIO.c
        probably should be modified; perhaps "monitor used for working around a bug
        in the sssd security daemon, which was observed in getpwuid_r() on RHEL 6.0,"
        or words to that effect. (Need not be that verbose, of course.)

        I also agree with Eli that we can leave the workaround disabled for tests.
        It might be worthwhile to add a log message at the start that "this test
        may fail (crash) with an invalid free() on some systems; see HADOOP-7156
        for details." Again, feel free to word it however you wish.

        Trivial grammo: "workaround" is a noun; the verb form is "work around"
        (similar to layout, backup, setup, cleanup, checkin, cutoff, etc.). The
        various variable names would be more proper if they reflected this (e.g.,
        WORK_AROUND_NON_THREADSAFE_CALLS_KEY, workAroundNonThreadSafePasswdCalls
        [or workAroundNonThreadsafePasswdCalls, since you're using "threadsafe"
        as a single word elsewhere]), but I won't fuss if you leave them as is.

        Show
        Greg Roelofs added a comment - Doh! FF crashed while I was replying, sigh. Switching to e-mail: In my experience, we do a really bad job of keeping the wiki up to date. Greg, what do you think? I agree--we're much better at keeping the code up to date (frequently in parallel across multiple branches ) than at keeping the wiki current. I think the XML config text is fine; you could optionally prefix it with "As of March 2011, systems known to ..." as a hint to users or future versions of us to recheck it if significant time has passed. The comment in NativeIO.c probably should be modified; perhaps "monitor used for working around a bug in the sssd security daemon, which was observed in getpwuid_r() on RHEL 6.0," or words to that effect. (Need not be that verbose, of course.) I also agree with Eli that we can leave the workaround disabled for tests. It might be worthwhile to add a log message at the start that "this test may fail (crash) with an invalid free() on some systems; see HADOOP-7156 for details." Again, feel free to word it however you wish. Trivial grammo: "workaround" is a noun; the verb form is "work around" (similar to layout, backup, setup, cleanup, checkin, cutoff, etc.). The various variable names would be more proper if they reflected this (e.g., WORK_AROUND_NON_THREADSAFE_CALLS_KEY, workAroundNonThreadSafePasswdCalls [or workAroundNonThreadsafePasswdCalls, since you're using "threadsafe" as a single word elsewhere]), but I won't fuss if you leave them as is.
        Hide
        Allen Wittenauer added a comment -

        > That will be RHEL 6.1, not RHEL 6.0. So this should remain accurate.

        If I rpm install just a new pam from the Red Hat repo, what version am I running?

        > I agree--we're much better at keeping the code up to date (frequently in
        > parallel across multiple branches ) than at keeping the wiki current.

        Really? We haven't had a release marked stable in over a year. (Releases outside Apache do not count.)

        Show
        Allen Wittenauer added a comment - > That will be RHEL 6.1, not RHEL 6.0. So this should remain accurate. If I rpm install just a new pam from the Red Hat repo, what version am I running? > I agree--we're much better at keeping the code up to date (frequently in > parallel across multiple branches ) than at keeping the wiki current. Really? We haven't had a release marked stable in over a year. (Releases outside Apache do not count.)
        Hide
        Greg Roelofs added a comment -

        Really? We haven't had a release marked stable in over a year. (Releases outside Apache do not count.)

        How many of the "currently relevant" wiki pages have been updated in that
        year-plus? (I haven't checked, but the one or two "getting started" ones
        I looked at way back when were incomplete at best, and possibly wrong in
        places, so I'm guessing the answer is "not many." That's certainly been
        my repeated experience with internal, corporate wikis.)

        In my experience, the farther the docs are from the code, the less likely
        they are to be current and correct. That doesn't mean there can't be
        additional forms of non-repo docs, but docs within the source-code repo
        should be the priority. Many of us live in the repo, employmentally
        speaking. (New word. Cool.)

        Show
        Greg Roelofs added a comment - Really? We haven't had a release marked stable in over a year. (Releases outside Apache do not count.) How many of the "currently relevant" wiki pages have been updated in that year-plus? (I haven't checked, but the one or two "getting started" ones I looked at way back when were incomplete at best, and possibly wrong in places, so I'm guessing the answer is "not many." That's certainly been my repeated experience with internal, corporate wikis.) In my experience, the farther the docs are from the code, the less likely they are to be current and correct. That doesn't mean there can't be additional forms of non-repo docs, but docs within the source-code repo should be the priority. Many of us live in the repo, employmentally speaking. (New word. Cool.)
        Hide
        Allen Wittenauer added a comment -

        >How many of the "currently relevant" wiki pages have been updated in
        > that year-plus?

        Considering how little has changed since Apache 0.20.0 was released almost two years, not many because there was no need.

        This JIRA is a great example of something that will likely be long fixed before an actual Apache release with the fix sees the light of day. Which is why I'm mostly opposed to it going in, but not enough to -1 it. I think we'll look like fools when a year down the road we have a reference to a bug that was long fixed, but whatever.

        > Many of us live in the repo, employmentally speaking.

        In other words, "Who gives a damn about the people who are not running some form of trunk and are not watching JIRAs religiously."

        Show
        Allen Wittenauer added a comment - >How many of the "currently relevant" wiki pages have been updated in > that year-plus? Considering how little has changed since Apache 0.20.0 was released almost two years, not many because there was no need. This JIRA is a great example of something that will likely be long fixed before an actual Apache release with the fix sees the light of day. Which is why I'm mostly opposed to it going in, but not enough to -1 it. I think we'll look like fools when a year down the road we have a reference to a bug that was long fixed, but whatever. > Many of us live in the repo, employmentally speaking. In other words, "Who gives a damn about the people who are not running some form of trunk and are not watching JIRAs religiously."
        Hide
        Greg Roelofs added a comment -

        That wasn't my personal experience (with the setup-related ones), but
        admittedly it's a tiny sample.

        Dude, I'm still running a 2004-era distro on one of my desktop boxes!
        Sure, that's unusual, but ancient servers in production definitely
        aren't. The more you have, the harder it is to validate and move to
        a new version. You've got, what, a few hundred servers to worry about?
        We've got a few hundred thousand. (Not all Hadoop-related, I mean, just
        in general.) I have no idea what the overall mix is, but I was aware of
        both RHEL 5.x and 4.x subsets up until fairly recently (the latter also
        being 2004-era; the former more like 2007, I think), and I wouldn't be
        surprised if there were still *BSD boxes lurking somewhere. I don't
        think there are any RHEL 6.x yet.

        So no, OS-related docs don't get stale that quickly, at least not for
        some of us.

        Not at all. It's simply a statement that those working most closely with
        the code and XML configs--i.e., those who do watch JIRAs and the mailing
        lists religiously--are more likely to keep things updated if they're
        nearby/noticeable. We all have limited attention-hours to devote to
        maintenance, so anything that makes it easier, faster, or more likely to
        happen is a Good Thing.

        And I know you're aware that the default state for documentation is
        "nonexistent." Especially where developers are concerned.

        (Btw, wasn't there a recent suggestion to automatically pull config keys
        and descriptions out of the foo-default.xml files and publish them? With
        something like that in place, the docs you want would fall out of this
        and similar patches immediately. One of my mottos for the last few years
        has been, "if it can be automated, it should be automated.")

        I guess this has drifted a bit; perhaps further discussion should go to
        one of the lists?

        Show
        Greg Roelofs added a comment - That wasn't my personal experience (with the setup-related ones), but admittedly it's a tiny sample. Dude, I'm still running a 2004-era distro on one of my desktop boxes! Sure, that's unusual, but ancient servers in production definitely aren't. The more you have, the harder it is to validate and move to a new version. You've got, what, a few hundred servers to worry about? We've got a few hundred thousand. (Not all Hadoop-related, I mean, just in general.) I have no idea what the overall mix is, but I was aware of both RHEL 5.x and 4.x subsets up until fairly recently (the latter also being 2004-era; the former more like 2007, I think), and I wouldn't be surprised if there were still *BSD boxes lurking somewhere. I don't think there are any RHEL 6.x yet. So no, OS-related docs don't get stale that quickly, at least not for some of us. Not at all. It's simply a statement that those working most closely with the code and XML configs--i.e., those who do watch JIRAs and the mailing lists religiously--are more likely to keep things updated if they're nearby/noticeable. We all have limited attention-hours to devote to maintenance, so anything that makes it easier, faster, or more likely to happen is a Good Thing. And I know you're aware that the default state for documentation is "nonexistent." Especially where developers are concerned. (Btw, wasn't there a recent suggestion to automatically pull config keys and descriptions out of the foo-default.xml files and publish them? With something like that in place, the docs you want would fall out of this and similar patches immediately. One of my mottos for the last few years has been, "if it can be automated, it should be automated.") I guess this has drifted a bit; perhaps further discussion should go to one of the lists?
        Hide
        Allen Wittenauer added a comment -

        Actually, rather than let you dig Yahoo! in deeper, I'm just going to -1 the patch due to the config description on the grounds that it is both too generic and too specific:

        • Are we going to update this list if Debian/Scientific Linux/Mandriva/... are found to be non-POSIX compliant?
        • If yes, are we going to push a new Apache release when this list is updated?
        • What does RHEL 6.0 actually mean? If I put a new pam rpm that fixes the issue, am I still running RHEL 6.0?

        A living document for when to use this as an option is going to be much better over the long haul than a static file.

        Show
        Allen Wittenauer added a comment - Actually, rather than let you dig Yahoo! in deeper, I'm just going to -1 the patch due to the config description on the grounds that it is both too generic and too specific: Are we going to update this list if Debian/Scientific Linux/Mandriva/... are found to be non-POSIX compliant? If yes, are we going to push a new Apache release when this list is updated? What does RHEL 6.0 actually mean? If I put a new pam rpm that fixes the issue, am I still running RHEL 6.0? A living document for when to use this as an option is going to be much better over the long haul than a static file.
        Hide
        Todd Lipcon added a comment -

        I will answer all of your questions and hope you will withdraw your -1:

        Are we going to update this list if Debian/Scientific Linux/Mandriva/... are found to be non-POSIX compliant?

        I will happily volunteer to +1 and commit a patch that anyone should submit to update the list. Please feel free to do so. The list doesn't claim to be complete, only to provide some examples of systems where it's the case.

        If yes, are we going to push a new Apache release when this list is updated?

        No, like any other documentation improvement it should wait for the next release. Our definitive documentation lives in the source tree. The fact that some of it is in the wiki doesn't change that.

        What does RHEL 6.0 actually mean? If I put a new pam rpm that fixes the issue, am I still running RHEL 6.0?

        I would assume RHEL 6.0 means that you are not installing packages that are not part of the RHEL 6.0 release. If you went and installed a random RPM that you built from source or some other vendor, you're no longer running a stock RHEL 6.0. Again, the docs are not meant to be complete. Assumedly if you've updated your pam manually to workaround this issue, you'd know that, and you wouldn't turn on the config option!

        And, seriously, with the HADOOP-7115 cache in place, this becomes a really rare race condition. Is it really worth making such a big deal? I'd like to move on to fixing other bugs, please.

        Show
        Todd Lipcon added a comment - I will answer all of your questions and hope you will withdraw your -1: Are we going to update this list if Debian/Scientific Linux/Mandriva/... are found to be non-POSIX compliant? I will happily volunteer to +1 and commit a patch that anyone should submit to update the list. Please feel free to do so. The list doesn't claim to be complete, only to provide some examples of systems where it's the case. If yes, are we going to push a new Apache release when this list is updated? No, like any other documentation improvement it should wait for the next release. Our definitive documentation lives in the source tree. The fact that some of it is in the wiki doesn't change that. What does RHEL 6.0 actually mean? If I put a new pam rpm that fixes the issue, am I still running RHEL 6.0? I would assume RHEL 6.0 means that you are not installing packages that are not part of the RHEL 6.0 release. If you went and installed a random RPM that you built from source or some other vendor, you're no longer running a stock RHEL 6.0. Again, the docs are not meant to be complete. Assumedly if you've updated your pam manually to workaround this issue, you'd know that, and you wouldn't turn on the config option! And, seriously, with the HADOOP-7115 cache in place, this becomes a really rare race condition. Is it really worth making such a big deal? I'd like to move on to fixing other bugs, please.
        Hide
        Allen Wittenauer added a comment -

        > No, like any other documentation improvement it should wait for the
        > next release. Our definitive documentation lives in the source tree.
        > The fact that some of it is in the wiki doesn't change that.

        I agree to a certain extent. But listing a moving target from suffering from a problem that exists today that may not exist tomorrow for a release of our software that is still months out is a mistake. CentOS 6.0 doesn't even exist yet in any official form yet we're declaring it bugged.

        > Assumedly if you've updated your pam manually to workaround this issue,
        > you'd know that, and you wouldn't turn on the config option!

        What I am applying the 6-updates repo after a kick? Are packages from https://rhn.redhat.com/errata/rhel-server-6-errata.html not considered part of RHEL 6? I suspect Red Hat would disagree.

        I realize that for the majority of committers don't actually use an Apache release so this doesn't seem like a big deal since you folks release seemingly monthly, if not more frequent. I also realize that in order for the various commercial interests involved it behooves them to get any and all patches committed to show that they are "contributing". The reality is that by the time this patch shows up in an actual Apache release, it is bound to be outdated.... and that's where I take offense.

        Show
        Allen Wittenauer added a comment - > No, like any other documentation improvement it should wait for the > next release. Our definitive documentation lives in the source tree. > The fact that some of it is in the wiki doesn't change that. I agree to a certain extent. But listing a moving target from suffering from a problem that exists today that may not exist tomorrow for a release of our software that is still months out is a mistake. CentOS 6.0 doesn't even exist yet in any official form yet we're declaring it bugged. > Assumedly if you've updated your pam manually to workaround this issue, > you'd know that, and you wouldn't turn on the config option! What I am applying the 6-updates repo after a kick? Are packages from https://rhn.redhat.com/errata/rhel-server-6-errata.html not considered part of RHEL 6? I suspect Red Hat would disagree. I realize that for the majority of committers don't actually use an Apache release so this doesn't seem like a big deal since you folks release seemingly monthly, if not more frequent. I also realize that in order for the various commercial interests involved it behooves them to get any and all patches committed to show that they are "contributing". The reality is that by the time this patch shows up in an actual Apache release, it is bound to be outdated.... and that's where I take offense.
        Hide
        Todd Lipcon added a comment -

        Made a wiki page and linked to it.

        If you don't like this patch or the wiki page, please edit it and contribute a version you find acceptable. Otherwise I will consider this version acceptable and commit tomorrow.

        Show
        Todd Lipcon added a comment - Made a wiki page and linked to it. If you don't like this patch or the wiki page, please edit it and contribute a version you find acceptable. Otherwise I will consider this version acceptable and commit tomorrow.
        Todd Lipcon made changes -
        Attachment hadoop-7156.txt [ 12473179 ]
        Hide
        Eli Collins added a comment -

        Hey Allen,

        If you find the config description "both too generic and too specific" perhaps you could suggest a better one before veto'ing the patch?

        Thanks,
        Eli

        Show
        Eli Collins added a comment - Hey Allen, If you find the config description "both too generic and too specific" perhaps you could suggest a better one before veto'ing the patch? Thanks, Eli
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12473179/hadoop-7156.txt
        against trunk revision 1080396.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12473179/hadoop-7156.txt against trunk revision 1080396. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/306//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        +1 for the latest patch

        Show
        Eli Collins added a comment - +1 for the latest patch
        Hide
        Allen Wittenauer added a comment -

        Removing my -1.

        Show
        Allen Wittenauer added a comment - Removing my -1.
        Todd Lipcon made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Release Note Adds a new configuration hadoop.work.around.non.threadsafe.getpwuid which can be used to enable a mutex around this call to workaround thread-unsafe implementations of getpwuid_r. Users should consult http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations for a list of such systems.
        Resolution Fixed [ 1 ]
        Hide
        Todd Lipcon added a comment -

        Thanks. Committed to trunk and 22

        Show
        Todd Lipcon added a comment - Thanks. Committed to trunk and 22
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #524 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/524/)
        HADOOP-7156. Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #524 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/524/ ) HADOOP-7156 . Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-22-branch #32 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/32/)
        HADOOP-7156. Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-22-branch #32 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/32/ ) HADOOP-7156 . Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #628 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/628/)
        HADOOP-7156. Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #628 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/628/ ) HADOOP-7156 . Workaround for unsafe implementations of getpwuid_r. Contributed by Todd Lipcon.
        Hide
        Allen Wittenauer added a comment -

        Red Hat has released a fix for this as part of RHEL 6.1: http://rhn.redhat.com/errata/RHSA-2011-0560.html .

        Show
        Allen Wittenauer added a comment - Red Hat has released a fix for this as part of RHEL 6.1: http://rhn.redhat.com/errata/RHSA-2011-0560.html .
        Hide
        Scott Carey added a comment -

        Red Hat has released a fix for this as part of RHEL 6.1

        And months later, CentOS 6.1 is not out – I would venture that many Hadoop users are on CentOS (or SL) and not RHEL.

        Thanks for fixing this! For years to come, I suspect that some will still have RHEL 6.0 or CentOS 6.0 servers floating around that might run into this.

        FWIW, we're happily using CentOS 6.0 with Hadoop right now.

        Show
        Scott Carey added a comment - Red Hat has released a fix for this as part of RHEL 6.1 And months later, CentOS 6.1 is not out – I would venture that many Hadoop users are on CentOS (or SL) and not RHEL. Thanks for fixing this! For years to come, I suspect that some will still have RHEL 6.0 or CentOS 6.0 servers floating around that might run into this. FWIW, we're happily using CentOS 6.0 with Hadoop right now.
        Konstantin Shvachko made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        10d 17h 30m 1 Todd Lipcon 08/Mar/11 02:53
        Patch Available Patch Available Resolved Resolved
        3d 15h 51m 1 Todd Lipcon 11/Mar/11 18:45
        Resolved Resolved Closed Closed
        275d 11h 33m 1 Konstantin Shvachko 12/Dec/11 06:19

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development