What make me concerned is that the code has to bring in a lot more dependency in plain C, which has a high cost on maintenance
Currently, the libraries we depend on are: libuv, for portability primitives, protobuf-c, for protobuf functionality, expat, for XML parsing, and liburiparser, for parsing URIs. None of that functionality is provided by the C++ standard library, so your statement is false.
For example, this patch at least contains implementation of linked list, splay tress, hash tables, and rb trees. There are a lot of overheads on implementing, reviewing and testing the code.
A lot of this code is not new. For example, we were using tree.h (which implements splay trees and rb trees), previously in libhdfs. The maintenance burden was not high. In fact, it was zero, because we never had to fix a bug in tree.h. So once again, your statement is just false.
htable.c got a review because it is new code. I would hardly call reviewing new code a "maintenance burden." And anyway, there is a standard C way to use hash tables... the hcreate_r, hsearch_r, and hdestroy functions. We would like to use the standard way, but Windows doesn't implement these functions.
For example, do you considering supporting filenames in unicode? That way I think libicu might need to be brought into the picture.
First of all, the question of whether we should use libicu is independent of the question of whether we should use C++. libicu has a C interface, and the standard C++ libraries and runtime don't provide any unicode functionality beyond what the standard C libraries provide.
Second of all, I see no reason to use libicu. All the strings we are dealing with are UTF-8 supplied to and from protobuf. This means that they are null-terminated and can be printed and handled with existing string functions. libicu might come into the picture if we wanted to start normalizing unicode strings or using wide character strings. But we don't need or want to do that.
It looks to me that it is much more compelling to implement the code in a more modern language, say, c++11, where much of the headache right now is taken away by a mature standard library.
C++ first came on the scene in 1983. That is 31 years ago. C++ may be a lot of things, but "modern" isn't one of them. I was a C++ programmer for 10 years. I know the language about as well as anyone can. I specifically chose C for this project because of a few things.
Firstly, the challenge of maintaining a consistent C++ coding style is very, very large. This is true even when everyone is a professional C++ programmer working under the same roof. For a project like Hadoop, where C/C++ is not everyone's first language, the challenge is just unsupportable. The C++ learning curve is just much higher than C. You have to know everything you have to know for C, plus a lot of very tricky things that are unique to C++.
There are a lot of contentious issues in the community like use exceptions, or don't use exceptions? Use global constructors, or don't use global constructors? Use boost, or don't use boost? Use C++14 or use some older standard? Use runtime type information (dynamic_cast, typeof), or don't use runtime type information? Operator overloading, or no operator overloading?
There are reasonable arguments for each of these positions. For example, exceptions harm performance (because of the need to maintain data to do stack unwinding. See here: http://preshing.com/20110807/the-cost-of-enabling-exception-handling/. That's just if you don't use them... if you do use them, exceptions turn out to be a lot slower than return codes. They also can make code difficult to follow. C++ doesn't have checked exceptions, so you can never really know what any function will throw. For this reason, some fairly smart people at Google have decided to ban exceptions from their coding standard. This, in turn, means that it's difficult for libraries to throw exceptions, since open source projects using the Google Coding standard (and there are a lot of them) can't deal with exceptions. Of course, without exceptions, certain things in C++ are very hard to do. (By the way, I'm not interested in having the argument for/against exceptions here, just in noting that there is huge fragmentation here and reasonable people on both sides.)
A similar story could be told about all the other choices. The net effect is that we have to police a very large set of arbitrary style decisions that just wouldn't come up at all if we just used C.
C++ library APIs have binary compatibility issues. A lot of them. Just take a look at http://techbase.kde.org/Policies/Binary_Compatibility_Issues_With_C++. Again, how are we going to ensure that everyone follows these rules? It's nearly impossible. Considering the number of issues we've had maintaining API compatibility in Java, with Java's much simpler rules, this is just a deal-breaker. Whereas with C, the rules for maintaining binary compatibility are very simple.
C is available on every platform out there, even AIX. C++11 is only available on a subset of those platforms. This is another advantage of plain old C.
But more importantly, it's easy to bind other higher-level languages to C than it is to C++. For example, in Python you can use ctypes to easily wrap a C library. https://docs.python.org/2/library/ctypes.html. Do you want to use ctypes with C++? Then you're out of luck. http://stackoverflow.com/questions/1615813/how-to-use-c-classes-with-ctypes. A similar story could be told about golang, and most other high-level languages. You have to write a lot of boilerplate to wrap C++, and almost none for C.
If we were writing a new daemon or something, then I might consider C++, even C++11. Yes, C++11 added some good things. auto was a good idea (borrowed from golang or someplace), and move constructors are nice. But none of it addresses the problems above, and all of it just adds more complexity for people to master. What we are writing is just a client, and it's not that thick. Especially the YARN client, which just makes some RPCs and that's it. And the code is nearly done.
I'm not interested in having a language flamewar here. C has some advantages, and C++ has another set. For this particular project, the former outweigh the latter. I'm very familiar with C++ and I don't need a lecture on its advantages, having been a user for a decade.
If you are interested in writing a C++ interface for libhdfs or libyarn, then by all means do that. I think this interface should be in a header file only, to avoid the binary compatibility issues I mentioned earlier. Since the header file would be compiled by each client, we would be free to change it whenever we liked without worrying about binary compatibility.