Issue Details (XML | Word | Printable)

Key: HADOOP-4998
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Arun C Murthy
Reporter: Arun C Murthy
Votes: 0
Watchers: 19
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Implement a native OS runtime for Hadoop

Created: 08/Jan/09 11:00 PM   Updated: Yesterday 04:21 AM
Return to search
Component/s: native
Affects Version/s: None
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

Issue Links:
Reference
 


 Description  « Hide
It would be useful to implement a JNI-based runtime for Hadoop to get access to the native OS runtime. This would allow us to stop relying on exec'ing bash to get access to information such as user-groups, process limits etc. and for features such as chown/chgrp (org.apache.hadoop.util.Shell).

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Vinod K V added a comment - 09/Jan/09 04:53 AM
+1.

Of late, there have been many instances in which Mapred, particualrly the TT side, started to depend on platform specific features. I guess the handling of these can be made much easier with such a native runtime. Some of them I can immediately think of are memory management of tasks by TT(HADOOP-3581), starting tasks via job control(HADOOP-2721), starting tasks as the job submitting user(HADOOP-4490). But I am not very sure as to which of these and how much of it can be helped by a native runtime.


Doug Cutting added a comment - 09/Jan/09 05:18 AM
Arun> It would be useful to implement a JNI-based runtime [ ... ]

What do you mean by 'useful'? Is this about performance?

Arun> This would allow us to stop relying on exec'ing bash [ ... ]

It would only allow this on platforms where the native library is built, right? So, unless we intend to support this on Windows, Solaris and MacOS, we still have to maintain the shell-based implementations too.


Arun C Murthy added a comment - 09/Jan/09 08:18 AM
Yes to both.

We would need to maintain a shell-based implementation for platforms which do not the native implementations.

Having a JNI-based runtime would let us access runtime information in a significantly more performant manner. For e.g. HADOOP-4656 could be implemented by directly calling a native method which fetches groups via posix apis. The absence of which forces us to cache the output of 'groups' shell command.


Doug Cutting added a comment - 09/Jan/09 05:35 PM
So, since performance is the reason, do we have a benchmark that shows this as significant? If not, we'll need one, right?

Hong Tang added a comment - 09/Jan/09 06:49 PM
@Doug

It seems to be common sense that process invocation is orders of magnitude more expensive than function calls (or system calls).

Are you asking whether those invocations take a big portion of execution time in the real application?


Doug Cutting added a comment - 09/Jan/09 07:18 PM
> Are you asking whether those invocations take a big portion of execution time in the real application?

Yes, precisely.

If HADOOP-4656 is the motivating case, then, after HADOOP-4656, the namenode might become nearly unusably slow without native code, which would be a significant negative change. So, for HADOOP-4656, we might instead change the default to, e.g., read /etc/group rather than exec a command.

In general, it's best to design things so that we do not require native code for decent performance.


Raghu Angadi added a comment - 09/Jan/09 07:43 PM
Another use case:

If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets rather than Sun's epoll based implementation and avoid requiring extra 3 fds for each thread that is blocked.. this simplifies the implementation as well.


Doug Cutting added a comment - 09/Jan/09 08:03 PM
> If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets

Should each of these be a separate issue?

We currently have an optional libhadoop. If we want to add more to it, I can see proceeding in one of a few ways:

  • adding more optional optimizations there, on a case-by-case basis; or
  • replace all shell access with native code, replacing the reliance on bash with reliance on a native library.

But adding a few more optional optimizations doesn't seem like a single coherent issue and would better be addressed by more specific jiras, no?


Doug Cutting added a comment - 09/Jan/09 08:14 PM
Another thing to consider is building on http://apr.apache.org/. This seems to have most of what we'd want, and we could ship with pre-built versions for linux, windows, etc, since APR includes these. Then we might get rid of the bash & cygwin requirements. If we want to go this way it would be good to do an inventory of all the places we use bash and see how many APR might replace.

Raghu Angadi added a comment - 09/Jan/09 08:15 PM
> adding more optional optimizations there, on a case-by-case basis;

yes. it should be separate issue, if at all needed. I only mentioned it as one more thing Hadoop could use Native system call access for...


Steve Loughran added a comment - 12/Jan/09 10:39 AM
It's worth remembering that once you go to JNI, you are more at risk from memory leaks, pointer problems, race conditions and other C/C++ coding issues that can affect long-lived programs. The benchmarks would also need to track process memory consumption to make sure that switching to a JNI wrapper didn't cause the memory use of the app to grow, and it would be handy to have an option in production systems to switch from JNI to shell calls to see if it makes any observed problems go away.

This isn't a -1 to a JNI, just a warning that there can often be a price.

Incidentally, in past experiments of mine, a JNI call takes about 600 PII clock cycles cycles round trip; this is a lot less than starting a process, but not entirely free.


Allen Wittenauer added a comment - 12/Jan/09 07:55 PM
FWIW, we have already seen a condition where the secondary name node required more memory to operate than the primary name node to do a shell out due to a requirement to run whoami or id or whatever due to DFSClient being in the code path. (As it was explained to me and/or as I understood it). From an ops perspective, this is highly suspect....

Marco Nicosia added a comment - 14/Jan/09 06:17 PM
I am really close to raising this to a blocker for Hadoop 0.18.3, BUT, I'm worried this isn't the right ticket. The symptom we're struggling with is that when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.

I don't care if the solution is JNI or some IPC to some other process to run shell processes, but it's an inconvenient issue to work around. Is this the right ticket to take on fixing that, or should we create a different JIRA?


Arun C Murthy added a comment - 14/Jan/09 06:35 PM
Marco, we could use HADOOP-4656 in conjuction with this jira?

Doug Cutting added a comment - 14/Jan/09 09:02 PM
> when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.

Running a shell should use vfork and exec, and shouldn't double the memory use, should it? Let's figure out whether there's some other easily fixable bug here first.


Steve Loughran added a comment - 15/Jan/09 11:35 AM
I agree; the memory use for an exec() is odd and needs to be looked at.

Incidentally, one thing we are fond of doing is exec-ing a long-lived shell rather than running a shell script. This lets us use SSH to connect to nearby hosts, or to boost from being untrusted to root, but it may also have memory consumption benefits. You just need to keep that single shell connection open and issue commands down its IO streams.


Koji Noguchi added a comment - 15/Jan/09 07:39 PM

Running a shell should use vfork and exec, and shouldn't double the memory use, should it?

Created separate Jira. HADOOP-5059.


Philip Zeyliger added a comment - 01/Dec/09 04:21 AM
Figured I'd mention that Tomcat has some support for calling into APR:

Javadoc: http://tomcat.apache.org/tomcat-6.0-doc/api/org/apache/tomcat/jni/package-tree.html
Google codesearch link: http://www.google.com/codesearch/p?hl=en#cM_OVOKybvs/tomcat/tomcat-6/v6.0.10/src/apache-tomcat-6.0.10-src.zip|KNqCNnRERSg/apache-tomcat-6.0.10-src/java/org/apache/tomcat/jni/User.java&q=tomcat%20apr&d=10