Hadoop Common
  1. Hadoop Common
  2. HADOOP-4998

Implement a native OS runtime for Hadoop

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: native
    • Labels:
      None

      Description

      It would be useful to implement a JNI-based runtime for Hadoop to get access to the native OS runtime. This would allow us to stop relying on exec'ing bash to get access to information such as user-groups, process limits etc. and for features such as chown/chgrp (org.apache.hadoop.util.Shell).

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          Closing as fixed as libhadoop.so has all sorts of random ... uhh.. stuff in it now.

          Show
          Allen Wittenauer added a comment - Closing as fixed as libhadoop.so has all sorts of random ... uhh.. stuff in it now.
          Hide
          Todd Lipcon added a comment -

          JNA is GPL and thus we may not depend on it within Hadoop (Apache 2 license)

          Show
          Todd Lipcon added a comment - JNA is GPL and thus we may not depend on it within Hadoop (Apache 2 license)
          Hide
          Rajiv Chittajallu added a comment -
          Show
          Rajiv Chittajallu added a comment - can we use this http://kenai.com/projects/jna-posix ?
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Todd, will incorporate your feedback. The code isn't polished (eg I just copied executeShellCommand and toString from UnixUserGroupInformation, the naming eg getUsername vs getUserName needs to be improved, haven't tested the C code on multiple platforms, etc), just want to get feedback on the overall approach first, the patch is just to show what a Platform class would look like and how it could be implemented. Think the patches should be staged in phases, eg first move the existing code that uses Shell over to the new API (w/o changing any functionality) and then add new native implementations individually.

          Less specific: thoughts on making this into an interface, with an implementation for JniPlatformCall and ShellPlatformCall? This may be handy if someone wants to come along and implement WindowsPlatformCall or SolarisShellPlatformCall, etc.

          JNI supports multiple platforms so we should be able to have Posix and Windows implementations of libnativecall (PlatformCall.c). Perhaps though we make Platform an interface with JNI and Shell implementations.

          Show
          Eli Collins added a comment - Thanks for taking a look Todd, will incorporate your feedback. The code isn't polished (eg I just copied executeShellCommand and toString from UnixUserGroupInformation, the naming eg getUsername vs getUserName needs to be improved, haven't tested the C code on multiple platforms, etc), just want to get feedback on the overall approach first, the patch is just to show what a Platform class would look like and how it could be implemented. Think the patches should be staged in phases, eg first move the existing code that uses Shell over to the new API (w/o changing any functionality) and then add new native implementations individually. Less specific: thoughts on making this into an interface, with an implementation for JniPlatformCall and ShellPlatformCall? This may be handy if someone wants to come along and implement WindowsPlatformCall or SolarisShellPlatformCall, etc. JNI supports multiple platforms so we should be able to have Posix and Windows implementations of libnativecall (PlatformCall.c). Perhaps though we make Platform an interface with JNI and Shell implementations.
          Hide
          Tom White added a comment -

          Having getUserName() for the Java method and getUsername() for the native method is confusing, so I would introduce another class with the native methods (JniPlatformCall) and delegate to it from PlatformCall.

          Show
          Tom White added a comment - Having getUserName() for the Java method and getUsername() for the native method is confusing, so I would introduce another class with the native methods (JniPlatformCall) and delegate to it from PlatformCall.
          Hide
          Todd Lipcon added a comment -

          Specific comments:

          • In executeShellCommand, you refer to "groups" even though this is no longer groups-specific. I also think it's strange that this generic command automatically splits on whitespace for the caller - why not just return String, and then have them use String.split if they want?
          • Instead of defining toString in this class, you can use o.a.h.util.StringUtil.join, right?
          • In PlatformCall.c, do you need to actually check the HAVE_FOO_H definitions? Seems to me that the compiler will itself give you the error, no need to #error
          • sysconf(3) doesn't list GETPW_R_SIZE_MAX as POSIX compliant and some googling leads me to believe it might not be supported on FreeBSD, which I think is useful to Y!. If we provide a too-small buffer, we'll get ERANGE and can just try again with a larger one until we stop getting that error.
          • If getpwuid_r returns non-zero, should we raise an exception rather than returning null?

          Less specific: thoughts on making this into an interface, with an implementation for JniPlatformCall and ShellPlatformCall? This may be handy if someone wants to come along and implement WindowsPlatformCall or SolarisShellPlatformCall, etc.

          Show
          Todd Lipcon added a comment - Specific comments: In executeShellCommand, you refer to "groups" even though this is no longer groups-specific. I also think it's strange that this generic command automatically splits on whitespace for the caller - why not just return String, and then have them use String.split if they want? Instead of defining toString in this class, you can use o.a.h.util.StringUtil.join, right? In PlatformCall.c, do you need to actually check the HAVE_FOO_H definitions? Seems to me that the compiler will itself give you the error, no need to #error sysconf(3) doesn't list GETPW_R_SIZE_MAX as POSIX compliant and some googling leads me to believe it might not be supported on FreeBSD, which I think is useful to Y!. If we provide a too-small buffer, we'll get ERANGE and can just try again with a larger one until we stop getting that error. If getpwuid_r returns non-zero, should we raise an exception rather than returning null? Less specific: thoughts on making this into an interface, with an implementation for JniPlatformCall and ShellPlatformCall? This may be handy if someone wants to come along and implement WindowsPlatformCall or SolarisShellPlatformCall, etc.
          Hide
          Eli Collins added a comment -

          What do people think of introducing a "platform" class that

          • Provides a uniform API to the rest of Hadoop for accessing host platform information (eg du, df, username, etc)
          • The implementation makes a jni call to a host platform interface if the native library is available, and falls back to using the shell otherwise.

          I've attached a patch that adds a PlatformCall class that provides getUsername and modified UnixUserGroupInformation to use it. PlatforCall's getUsername method, if the native library is loaded, makes a jni call to a function (getUsername in PlatformCall.c) that uses getpwuid_r and geteuid to get the effective username, and shells out to whoami otherwise.

          If people like this approach I'll flesh out the patch to cover the other uses of Shell. The first step would be to refactor the code so that the uses of Shell that we want to replace are isolated to PlatformCall and then add additional native implementations.

          Show
          Eli Collins added a comment - What do people think of introducing a "platform" class that Provides a uniform API to the rest of Hadoop for accessing host platform information (eg du, df, username, etc) The implementation makes a jni call to a host platform interface if the native library is available, and falls back to using the shell otherwise. I've attached a patch that adds a PlatformCall class that provides getUsername and modified UnixUserGroupInformation to use it. PlatforCall's getUsername method, if the native library is loaded, makes a jni call to a function (getUsername in PlatformCall.c) that uses getpwuid_r and geteuid to get the effective username, and shells out to whoami otherwise. If people like this approach I'll flesh out the patch to cover the other uses of Shell. The first step would be to refactor the code so that the uses of Shell that we want to replace are isolated to PlatformCall and then add additional native implementations.
          Hide
          Philip Zeyliger added a comment -

          Figured I'd mention that Tomcat has some support for calling into APR:

          Javadoc: http://tomcat.apache.org/tomcat-6.0-doc/api/org/apache/tomcat/jni/package-tree.html
          Google codesearch link: http://www.google.com/codesearch/p?hl=en#cM_OVOKybvs/tomcat/tomcat-6/v6.0.10/src/apache-tomcat-6.0.10-src.zip|KNqCNnRERSg/apache-tomcat-6.0.10-src/java/org/apache/tomcat/jni/User.java&q=tomcat%20apr&d=10

          Show
          Philip Zeyliger added a comment - Figured I'd mention that Tomcat has some support for calling into APR: Javadoc: http://tomcat.apache.org/tomcat-6.0-doc/api/org/apache/tomcat/jni/package-tree.html Google codesearch link: http://www.google.com/codesearch/p?hl=en#cM_OVOKybvs/tomcat/tomcat-6/v6.0.10/src/apache-tomcat-6.0.10-src.zip |KNqCNnRERSg/apache-tomcat-6.0.10-src/java/org/apache/tomcat/jni/User.java&q=tomcat%20apr&d=10
          Hide
          Koji Noguchi added a comment -

          Running a shell should use vfork and exec, and shouldn't double the memory use, should it?

          Created separate Jira. HADOOP-5059.

          Show
          Koji Noguchi added a comment - Running a shell should use vfork and exec, and shouldn't double the memory use, should it? Created separate Jira. HADOOP-5059 .
          Hide
          Steve Loughran added a comment -

          I agree; the memory use for an exec() is odd and needs to be looked at.

          Incidentally, one thing we are fond of doing is exec-ing a long-lived shell rather than running a shell script. This lets us use SSH to connect to nearby hosts, or to boost from being untrusted to root, but it may also have memory consumption benefits. You just need to keep that single shell connection open and issue commands down its IO streams.

          Show
          Steve Loughran added a comment - I agree; the memory use for an exec() is odd and needs to be looked at. Incidentally, one thing we are fond of doing is exec-ing a long-lived shell rather than running a shell script. This lets us use SSH to connect to nearby hosts, or to boost from being untrusted to root, but it may also have memory consumption benefits. You just need to keep that single shell connection open and issue commands down its IO streams.
          Hide
          Doug Cutting added a comment -

          > when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.

          Running a shell should use vfork and exec, and shouldn't double the memory use, should it? Let's figure out whether there's some other easily fixable bug here first.

          Show
          Doug Cutting added a comment - > when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using. Running a shell should use vfork and exec, and shouldn't double the memory use, should it? Let's figure out whether there's some other easily fixable bug here first.
          Hide
          Arun C Murthy added a comment -

          Marco, we could use HADOOP-4656 in conjuction with this jira?

          Show
          Arun C Murthy added a comment - Marco, we could use HADOOP-4656 in conjuction with this jira?
          Hide
          Marco Nicosia added a comment -

          I am really close to raising this to a blocker for Hadoop 0.18.3, BUT, I'm worried this isn't the right ticket. The symptom we're struggling with is that when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.

          I don't care if the solution is JNI or some IPC to some other process to run shell processes, but it's an inconvenient issue to work around. Is this the right ticket to take on fixing that, or should we create a different JIRA?

          Show
          Marco Nicosia added a comment - I am really close to raising this to a blocker for Hadoop 0.18.3, BUT, I'm worried this isn't the right ticket. The symptom we're struggling with is that when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using. I don't care if the solution is JNI or some IPC to some other process to run shell processes, but it's an inconvenient issue to work around. Is this the right ticket to take on fixing that, or should we create a different JIRA?
          Hide
          Allen Wittenauer added a comment -

          FWIW, we have already seen a condition where the secondary name node required more memory to operate than the primary name node to do a shell out due to a requirement to run whoami or id or whatever due to DFSClient being in the code path. (As it was explained to me and/or as I understood it). From an ops perspective, this is highly suspect....

          Show
          Allen Wittenauer added a comment - FWIW, we have already seen a condition where the secondary name node required more memory to operate than the primary name node to do a shell out due to a requirement to run whoami or id or whatever due to DFSClient being in the code path. (As it was explained to me and/or as I understood it). From an ops perspective, this is highly suspect....
          Hide
          Steve Loughran added a comment -

          It's worth remembering that once you go to JNI, you are more at risk from memory leaks, pointer problems, race conditions and other C/C++ coding issues that can affect long-lived programs. The benchmarks would also need to track process memory consumption to make sure that switching to a JNI wrapper didn't cause the memory use of the app to grow, and it would be handy to have an option in production systems to switch from JNI to shell calls to see if it makes any observed problems go away.

          This isn't a -1 to a JNI, just a warning that there can often be a price.

          Incidentally, in past experiments of mine, a JNI call takes about 600 PII clock cycles cycles round trip; this is a lot less than starting a process, but not entirely free.

          Show
          Steve Loughran added a comment - It's worth remembering that once you go to JNI, you are more at risk from memory leaks, pointer problems, race conditions and other C/C++ coding issues that can affect long-lived programs. The benchmarks would also need to track process memory consumption to make sure that switching to a JNI wrapper didn't cause the memory use of the app to grow, and it would be handy to have an option in production systems to switch from JNI to shell calls to see if it makes any observed problems go away. This isn't a -1 to a JNI, just a warning that there can often be a price. Incidentally, in past experiments of mine, a JNI call takes about 600 PII clock cycles cycles round trip; this is a lot less than starting a process, but not entirely free.
          Hide
          Raghu Angadi added a comment -

          > adding more optional optimizations there, on a case-by-case basis;

          yes. it should be separate issue, if at all needed. I only mentioned it as one more thing Hadoop could use Native system call access for...

          Show
          Raghu Angadi added a comment - > adding more optional optimizations there, on a case-by-case basis; yes. it should be separate issue, if at all needed. I only mentioned it as one more thing Hadoop could use Native system call access for...
          Hide
          Doug Cutting added a comment -

          Another thing to consider is building on http://apr.apache.org/. This seems to have most of what we'd want, and we could ship with pre-built versions for linux, windows, etc, since APR includes these. Then we might get rid of the bash & cygwin requirements. If we want to go this way it would be good to do an inventory of all the places we use bash and see how many APR might replace.

          Show
          Doug Cutting added a comment - Another thing to consider is building on http://apr.apache.org/ . This seems to have most of what we'd want, and we could ship with pre-built versions for linux, windows, etc, since APR includes these. Then we might get rid of the bash & cygwin requirements. If we want to go this way it would be good to do an inventory of all the places we use bash and see how many APR might replace.
          Hide
          Doug Cutting added a comment -

          > If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets

          Should each of these be a separate issue?

          We currently have an optional libhadoop. If we want to add more to it, I can see proceeding in one of a few ways:

          • adding more optional optimizations there, on a case-by-case basis; or
          • replace all shell access with native code, replacing the reliance on bash with reliance on a native library.

          But adding a few more optional optimizations doesn't seem like a single coherent issue and would better be addressed by more specific jiras, no?

          Show
          Doug Cutting added a comment - > If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets Should each of these be a separate issue? We currently have an optional libhadoop. If we want to add more to it, I can see proceeding in one of a few ways: adding more optional optimizations there, on a case-by-case basis; or replace all shell access with native code, replacing the reliance on bash with reliance on a native library. But adding a few more optional optimizations doesn't seem like a single coherent issue and would better be addressed by more specific jiras, no?
          Hide
          Raghu Angadi added a comment -

          Another use case:

          If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets rather than Sun's epoll based implementation and avoid requiring extra 3 fds for each thread that is blocked.. this simplifies the implementation as well.

          Show
          Raghu Angadi added a comment - Another use case: If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets rather than Sun's epoll based implementation and avoid requiring extra 3 fds for each thread that is blocked.. this simplifies the implementation as well.
          Hide
          Doug Cutting added a comment -

          > Are you asking whether those invocations take a big portion of execution time in the real application?

          Yes, precisely.

          If HADOOP-4656 is the motivating case, then, after HADOOP-4656, the namenode might become nearly unusably slow without native code, which would be a significant negative change. So, for HADOOP-4656, we might instead change the default to, e.g., read /etc/group rather than exec a command.

          In general, it's best to design things so that we do not require native code for decent performance.

          Show
          Doug Cutting added a comment - > Are you asking whether those invocations take a big portion of execution time in the real application? Yes, precisely. If HADOOP-4656 is the motivating case, then, after HADOOP-4656 , the namenode might become nearly unusably slow without native code, which would be a significant negative change. So, for HADOOP-4656 , we might instead change the default to, e.g., read /etc/group rather than exec a command. In general, it's best to design things so that we do not require native code for decent performance.
          Hide
          Hong Tang added a comment -

          @Doug

          It seems to be common sense that process invocation is orders of magnitude more expensive than function calls (or system calls).

          Are you asking whether those invocations take a big portion of execution time in the real application?

          Show
          Hong Tang added a comment - @Doug It seems to be common sense that process invocation is orders of magnitude more expensive than function calls (or system calls). Are you asking whether those invocations take a big portion of execution time in the real application?
          Hide
          Doug Cutting added a comment -

          So, since performance is the reason, do we have a benchmark that shows this as significant? If not, we'll need one, right?

          Show
          Doug Cutting added a comment - So, since performance is the reason, do we have a benchmark that shows this as significant? If not, we'll need one, right?
          Hide
          Arun C Murthy added a comment -

          Yes to both.

          We would need to maintain a shell-based implementation for platforms which do not the native implementations.

          Having a JNI-based runtime would let us access runtime information in a significantly more performant manner. For e.g. HADOOP-4656 could be implemented by directly calling a native method which fetches groups via posix apis. The absence of which forces us to cache the output of 'groups' shell command.

          Show
          Arun C Murthy added a comment - Yes to both. We would need to maintain a shell-based implementation for platforms which do not the native implementations. Having a JNI-based runtime would let us access runtime information in a significantly more performant manner. For e.g. HADOOP-4656 could be implemented by directly calling a native method which fetches groups via posix apis. The absence of which forces us to cache the output of 'groups' shell command.
          Hide
          Doug Cutting added a comment -

          Arun> It would be useful to implement a JNI-based runtime [ ... ]

          What do you mean by 'useful'? Is this about performance?

          Arun> This would allow us to stop relying on exec'ing bash [ ... ]

          It would only allow this on platforms where the native library is built, right? So, unless we intend to support this on Windows, Solaris and MacOS, we still have to maintain the shell-based implementations too.

          Show
          Doug Cutting added a comment - Arun> It would be useful to implement a JNI-based runtime [ ... ] What do you mean by 'useful'? Is this about performance? Arun> This would allow us to stop relying on exec'ing bash [ ... ] It would only allow this on platforms where the native library is built, right? So, unless we intend to support this on Windows, Solaris and MacOS, we still have to maintain the shell-based implementations too.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          +1.

          Of late, there have been many instances in which Mapred, particualrly the TT side, started to depend on platform specific features. I guess the handling of these can be made much easier with such a native runtime. Some of them I can immediately think of are memory management of tasks by TT(HADOOP-3581), starting tasks via job control(HADOOP-2721), starting tasks as the job submitting user(HADOOP-4490). But I am not very sure as to which of these and how much of it can be helped by a native runtime.

          Show
          Vinod Kumar Vavilapalli added a comment - +1. Of late, there have been many instances in which Mapred, particualrly the TT side, started to depend on platform specific features. I guess the handling of these can be made much easier with such a native runtime. Some of them I can immediately think of are memory management of tasks by TT( HADOOP-3581 ), starting tasks via job control( HADOOP-2721 ), starting tasks as the job submitting user( HADOOP-4490 ). But I am not very sure as to which of these and how much of it can be helped by a native runtime.

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development