|
Arun> It would be useful to implement a JNI-based runtime [ ... ]
What do you mean by 'useful'? Is this about performance? Arun> This would allow us to stop relying on exec'ing bash [ ... ] It would only allow this on platforms where the native library is built, right? So, unless we intend to support this on Windows, Solaris and MacOS, we still have to maintain the shell-based implementations too. Yes to both.
We would need to maintain a shell-based implementation for platforms which do not the native implementations. Having a JNI-based runtime would let us access runtime information in a significantly more performant manner. For e.g. HADOOP-4656 could be implemented by directly calling a native method which fetches groups via posix apis. The absence of which forces us to cache the output of 'groups' shell command. So, since performance is the reason, do we have a benchmark that shows this as significant? If not, we'll need one, right?
@Doug
It seems to be common sense that process invocation is orders of magnitude more expensive than function calls (or system calls). Are you asking whether those invocations take a big portion of execution time in the real application? > Are you asking whether those invocations take a big portion of execution time in the real application?
Yes, precisely. If HADOOP-4656 is the motivating case, then, after HADOOP-4656, the namenode might become nearly unusably slow without native code, which would be a significant negative change. So, for HADOOP-4656, we might instead change the default to, e.g., read /etc/group rather than exec a command. In general, it's best to design things so that we do not require native code for decent performance. Another use case:
If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets rather than Sun's epoll based implementation and avoid requiring extra 3 fds for each thread that is blocked.. this simplifies the implementation as well. > If we have access to poll(), we would just us that for Hadoop's blocking IO on non-blocking sockets
Should each of these be a separate issue? We currently have an optional libhadoop. If we want to add more to it, I can see proceeding in one of a few ways:
But adding a few more optional optimizations doesn't seem like a single coherent issue and would better be addressed by more specific jiras, no? Another thing to consider is building on http://apr.apache.org/
> adding more optional optimizations there, on a case-by-case basis;
yes. it should be separate issue, if at all needed. I only mentioned it as one more thing Hadoop could use Native system call access for... It's worth remembering that once you go to JNI, you are more at risk from memory leaks, pointer problems, race conditions and other C/C++ coding issues that can affect long-lived programs. The benchmarks would also need to track process memory consumption to make sure that switching to a JNI wrapper didn't cause the memory use of the app to grow, and it would be handy to have an option in production systems to switch from JNI to shell calls to see if it makes any observed problems go away.
This isn't a -1 to a JNI, just a warning that there can often be a price. Incidentally, in past experiments of mine, a JNI call takes about 600 PII clock cycles cycles round trip; this is a lot less than starting a process, but not entirely free. FWIW, we have already seen a condition where the secondary name node required more memory to operate than the primary name node to do a shell out due to a requirement to run whoami or id or whatever due to DFSClient being in the code path. (As it was explained to me and/or as I understood it). From an ops perspective, this is highly suspect....
I am really close to raising this to a blocker for Hadoop 0.18.3, BUT, I'm worried this isn't the right ticket. The symptom we're struggling with is that when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.
I don't care if the solution is JNI or some IPC to some other process to run shell processes, but it's an inconvenient issue to work around. Is this the right ticket to take on fixing that, or should we create a different JIRA? Marco, we could use HADOOP-4656 in conjuction with this jira?
> when a primary or secondary namenode wants to fork to run bash, it requires double memory than what it's currently using.
Running a shell should use vfork and exec, and shouldn't double the memory use, should it? Let's figure out whether there's some other easily fixable bug here first. I agree; the memory use for an exec() is odd and needs to be looked at.
Incidentally, one thing we are fond of doing is exec-ing a long-lived shell rather than running a shell script. This lets us use SSH to connect to nearby hosts, or to boost from being untrusted to root, but it may also have memory consumption benefits. You just need to keep that single shell connection open and issue commands down its IO streams.
Created separate Jira. HADOOP-5059. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Of late, there have been many instances in which Mapred, particualrly the TT side, started to depend on platform specific features. I guess the handling of these can be made much easier with such a native runtime. Some of them I can immediately think of are memory management of tasks by TT(
HADOOP-3581), starting tasks via job control(HADOOP-2721), starting tasks as the job submitting user(HADOOP-4490). But I am not very sure as to which of these and how much of it can be helped by a native runtime.