What will the project structure looks like? A separate top-level hadoop-native-client-project? Or seperate code files in common/hdfs/yarn existing dirs?
I think a separate top-level project is best, since this will allow the YARN native client and the HDFS native client to share code much more easily. They will have a lot of shared code. We can have a Maven profile that causes this subproject to be built.
Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds like core.
I guess my thinking here is that these libraries are speaking the core hadoop protocol. I am open to other names if you have something better. One problem with choosing a name is that "libhdfs" and "libhadoop" are already taken. We also already have directories named "native," so that would be confusing as well. We also need a name that is fairly short, since it will appear in header file names, object names, etc. etc. We could do "libhdfs-ng.so", I guess.
In short, what libraries are planned to be used?
libuv, libprotobuf-c, something for XML parsing, something for URI parsing.
CMake already has a unit test framework called CTest so we use that.
I like the library to be lightweight, some people just want a header file and a static linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature: xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better be optional, not required.
I agree that having an option for static linking would be good. We also need to think carefully about compatibility and what the header file will look like.
The reason for supporting config file parsing is that we want this library to be a drop-in replacement for libhdfs.so. libhdfs.so is a JNI-based library used by a lot of C and C++ projects such as fuse_dfs and Impala, libhdfs.so will read configuration XML files in the usual way just by invoking the Java Configuration code. If this library is not a drop-in replacement for libhdfs.so, most projects simply will not be able to use it. The other reason for supporting config file parsing is that, well, you need some way of configuring the client! If we end up re-inventing the configuration wheel in a different way, that will not be good for anyone.
Some clients may not want to read XML files, but simply set all the configuration keys themselves. That's fine, and we can support this. We can even make the XML-reading code optional if you want.
Thread pools and async I/O, I'm afraid, are something we can't live without. The HDFS client needs to do certain operations in the background. If you study the existing DFSOutputStream code, you'll see that the DFSOutputStream does transfers in the background while the client continues to fill a buffer. This is essential to get good performance, since otherwise we'd have to stop and wait for the packet to be written to all 3 datanodes in the pipeline every time our 64kb chunk filled up. Take a look at the existing HDFS client code to get a sense for what a native client would be like.