Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: HADOOP-10388
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A pure native hadoop client has following use case/advantages:
      1. writing Yarn applications using c++
      2. direct access to HDFS, without extra proxy overhead, comparing to web/nfs interface.
      3. wrap native library to support more languages, e.g. python
      4. lightweight, small footprint compare to several hundred MB of JDK and hadoop library with various dependencies.

        Issue Links

        1.
        Native RPCv9 client Sub-task Resolved Colin Patrick McCabe
         
        2.
        fix hadoop native client CMakeLists.txt issue with older cmakes Sub-task Resolved Wenwu Peng
         
        3.
        Fix some minors error and compile on macosx Sub-task Resolved Binglin Chang
         
        4.
        Add username to native RPCv9 client Sub-task Resolved Colin Patrick McCabe
         
        5.
        Add unit test case for net in hadoop native client Sub-task Resolved Wenwu Peng
         
        6.
        Fix some minor typos and add more test cases for hadoop_err Sub-task Resolved Wenwu Peng
         
        7.
        Native Hadoop Client: make clean should remove pb-c.h.s files Sub-task Resolved Binglin Chang
         
        8.
        add pom.xml infrastructure for hadoop-native-core Sub-task Resolved Binglin Chang
         
        9.
        Implement Namenode RPCs in HDFS native client Sub-task Resolved Colin Patrick McCabe
         
        10.
        Implement C code for parsing Hadoop / HDFS URIs Sub-task Resolved Colin Patrick McCabe
         
        11.
        native code for reading Hadoop configuration XML files Sub-task Resolved Colin Patrick McCabe
         
        12.
        Fix initialization of hrpc_sync_ctx Sub-task Resolved Binglin Chang
         
        13.
        limit symbol visibility in libhdfs-core.so and libyarn-core.so Sub-task Resolved Colin Patrick McCabe
         
        14.
        Native Hadoop Client:add unit test case for call&client_id Sub-task Resolved Wenwu Peng
         
        15.
        implement TCP connection reuse for native client Sub-task Resolved Colin Patrick McCabe
         
        16.
        Fix namenode-rpc-unit warning reported by memory leak check tool(valgrind) Sub-task Resolved Wenwu Peng
         
        17.
        ndfs hdfsDelete should check the return boolean Sub-task Resolved Colin Patrick McCabe
         
        18.
        ndfs: need to implement umask, pass permission bits to hdfsCreateDirectory Sub-task Resolved Colin Patrick McCabe
         
        19.
        native client: refactor URI code to be clearer Sub-task Resolved Colin Patrick McCabe
         
        20.
        Implement listStatus and getFileInfo in the native client Sub-task Resolved Colin Patrick McCabe
         
        21.
        native client: implement hdfsMove and hdfsCopy Sub-task Resolved Colin Patrick McCabe
         
        22.
        native client: split ndfs.c into meta, file, util, and permission Sub-task Resolved Colin Patrick McCabe
         
        23. Implement DataTransferProtocol in libhdfs-core.so Sub-task Open Unassigned
         
        24. Pure Native Client: implement C code native_mini for YARN for unit test Sub-task Open Wenwu Peng
         
        25. native client: parse Hadoop permission strings Sub-task Open Unassigned
         
        26. hconf.c: fix bug where we would sometimes not try to load multiple XML files from the same path Sub-task Patch Available Colin Patrick McCabe
         
        27. implement ndfs_get_hosts Sub-task In Progress Colin Patrick McCabe
         

          Activity

          Hide
          Colin Patrick McCabe added a comment -

          This seems like a good idea. I would add another advantage to the list: a C or C++ library could make it easier to debug crashes than a JNI one.

          One issue here is that we could be talking about a lot of code here. I think we should decide on a coding style to use so that everyone can contribute and understand the code that others have written. If C+, I would suggest the Google C+ style guide. See http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml

          Pure C is also a viable choice, in which case I would recommend something more like the Linux kernel style guide (but with 4-space indent)

          Show
          Colin Patrick McCabe added a comment - This seems like a good idea. I would add another advantage to the list: a C or C++ library could make it easier to debug crashes than a JNI one. One issue here is that we could be talking about a lot of code here. I think we should decide on a coding style to use so that everyone can contribute and understand the code that others have written. If C+ , I would suggest the Google C + style guide. See http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml Pure C is also a viable choice, in which case I would recommend something more like the Linux kernel style guide (but with 4-space indent)
          Hide
          Chris Nauroth added a comment -

          I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards. If we choose C, then we might have to stick to an old dialect, like we currently do in libhadoop.so/hadoop.dll.

          Show
          Chris Nauroth added a comment - I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards. If we choose C, then we might have to stick to an old dialect, like we currently do in libhadoop.so/hadoop.dll.
          Hide
          Arpit Agarwal added a comment -

          I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards.

          VS 2013 claims C99 support. No idea whether this applies to the free compiler included with the Windows SDK.

          Show
          Arpit Agarwal added a comment - I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards. VS 2013 claims C99 support . No idea whether this applies to the free compiler included with the Windows SDK.
          Hide
          Colin Patrick McCabe added a comment -

          I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards. If we choose C, then we might have to stick to an old dialect, like we currently do in libhadoop.so/hadoop.dll.

          I agree. The kernel coding standard is designed around C89, which makes it good for this use-case.

          VS 2013 claims C99 support. No idea whether this applies to the free compiler included with the Windows SDK.

          That link says they are going to support a few specific features. I'm not aware of any claims that they will have full C99 support; did I miss something?

          We might consider allowing designated initializers (I think the kernel coding style makes use of them). C99 Variable-sized arrays on the stack are actually something to avoid because you don't want stack overflows (the Google coding standard bans them for this reason).

          Show
          Colin Patrick McCabe added a comment - I believe there are still no plans for the Microsoft compiler to provide full support for the most recent C standards. If we choose C, then we might have to stick to an old dialect, like we currently do in libhadoop.so/hadoop.dll. I agree. The kernel coding standard is designed around C89, which makes it good for this use-case. VS 2013 claims C99 support. No idea whether this applies to the free compiler included with the Windows SDK. That link says they are going to support a few specific features. I'm not aware of any claims that they will have full C99 support; did I miss something? We might consider allowing designated initializers (I think the kernel coding style makes use of them). C99 Variable-sized arrays on the stack are actually something to avoid because you don't want stack overflows (the Google coding standard bans them for this reason).
          Hide
          Binglin Chang added a comment -

          Although c is viable, I would suggest using c++11, that will help us get rid of lot of dependencies and make the code smaller.
          I was writing a client just for fun, it uses c++11, depends on protobuf, json-c, sasl2, gtest and cmake, about 8k LOC. It is on github now: https://github.com/decster/libhadoopclient.
          Hope some of the code can be useful here.

          Show
          Binglin Chang added a comment - Although c is viable, I would suggest using c++11, that will help us get rid of lot of dependencies and make the code smaller. I was writing a client just for fun, it uses c++11, depends on protobuf, json-c, sasl2, gtest and cmake, about 8k LOC. It is on github now: https://github.com/decster/libhadoopclient . Hope some of the code can be useful here.
          Hide
          Colin Patrick McCabe added a comment -

          C++ is viable, but we need to remember that C++11 is not a coding standard. We need to adopt a coding standard such as the Google Coding Standard and follow it closely for C++ to be a practical choice.

          I also think we should be able to compile this code on Red Hat 6, since that's the platform that a lot of our users are using (actually, some of our users are using RHEL5, but I think we can safely assume that the sun will set on that one before this is ready.) This means certain bleeding-edge C++ features will be unavailable. But on the plus side, we won't use language features that later turn out to be problematic (it's nice not being an early adopter sometimes.)

          I also wrote some C++ code in this direction a while ago. I never finished it, though. I'll see if I can dig it up.

          Show
          Colin Patrick McCabe added a comment - C++ is viable, but we need to remember that C++11 is not a coding standard. We need to adopt a coding standard such as the Google Coding Standard and follow it closely for C++ to be a practical choice. I also think we should be able to compile this code on Red Hat 6, since that's the platform that a lot of our users are using (actually, some of our users are using RHEL5, but I think we can safely assume that the sun will set on that one before this is ready.) This means certain bleeding-edge C++ features will be unavailable. But on the plus side, we won't use language features that later turn out to be problematic (it's nice not being an early adopter sometimes.) I also wrote some C++ code in this direction a while ago. I never finished it, though. I'll see if I can dig it up.
          Hide
          Binglin Chang added a comment -

          About coding standard, google is mostly fine.
          I mention c++11 mostly for the new std libraries(thread, lock/condition, random, unique_ptr/shared_ptr, regex), so we can avoid writing lot of common utility code, it's fine we use boost instead, and provide typedefs so c++11 or boost can both be an option, old compiler can use boost instead, new compiler can avoid boost dependency.
          Agree with Colin, I tend to avoid using "fancy" language features such as lambda, template, std::function.
          For compatibility the code should be plain simple, especially for public api, c++ does not have good binary compatibility(mainly virtual method) issue, so we need to be careful.

          Show
          Binglin Chang added a comment - About coding standard, google is mostly fine. I mention c++11 mostly for the new std libraries(thread, lock/condition, random, unique_ptr/shared_ptr, regex), so we can avoid writing lot of common utility code, it's fine we use boost instead, and provide typedefs so c++11 or boost can both be an option, old compiler can use boost instead, new compiler can avoid boost dependency. Agree with Colin, I tend to avoid using "fancy" language features such as lambda, template, std::function. For compatibility the code should be plain simple, especially for public api, c++ does not have good binary compatibility(mainly virtual method) issue, so we need to be careful.
          Hide
          Steve Loughran added a comment -
          1. 1 to avoid leading edge features that need practical experience in writing modern C+ code to make use of. Some of us committers may have C/C++ skills, but they'll not be current.
          2. I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production.
          3. I'm not up to date with C++ test frameworks; xUnit code always played up in the past (2003) as tests weren't independent enough, and resource leakage never got picked up. Something good that Jenkins could run against MiniDFS and MiniYARN clusters would be nice.

          c++ does not have good binary compatibility(mainly virtual method) issue, so we need to be careful.

          Assume destination-side build, source RPMs & debs with specific version binaries produced by downstream packagers. Windows builds which we should also support if possible will need locking down by the windows developer groups. That is "no guarantee of binary compatibility across versions for now"

          Show
          Steve Loughran added a comment - 1 to avoid leading edge features that need practical experience in writing modern C + code to make use of. Some of us committers may have C/C++ skills, but they'll not be current. I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production. I'm not up to date with C++ test frameworks; xUnit code always played up in the past (2003) as tests weren't independent enough, and resource leakage never got picked up. Something good that Jenkins could run against MiniDFS and MiniYARN clusters would be nice. c++ does not have good binary compatibility(mainly virtual method) issue, so we need to be careful. Assume destination-side build, source RPMs & debs with specific version binaries produced by downstream packagers. Windows builds which we should also support if possible will need locking down by the windows developer groups. That is "no guarantee of binary compatibility across versions for now"
          Hide
          Binglin Chang added a comment -

          I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production.

          Agree. MacOSX is more like freebsd, I do most of my coding in mac, can help make sure mac build and test.

          I'm not up to date with C++ test frameworks

          Although I havent try other test frameworks, I would recommend gtest, it is small and convenient(just a .cc file can embed into test program). If we are using google c++ coding standard, protobuf, using another google framework seems natural.

          Show
          Binglin Chang added a comment - I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production. Agree. MacOSX is more like freebsd, I do most of my coding in mac, can help make sure mac build and test. I'm not up to date with C++ test frameworks Although I havent try other test frameworks, I would recommend gtest, it is small and convenient(just a .cc file can embed into test program). If we are using google c++ coding standard, protobuf, using another google framework seems natural.
          Hide
          Colin Patrick McCabe added a comment -

          Binglin said: I mention c++11 mostly for the new std libraries(thread, lock/condition, random, unique_ptr/shared_ptr, regex), so we can avoid writing lot of common utility code, it's fine we use boost instead, and provide typedefs so c++11 or boost can both be an option, old compiler can use boost instead, new compiler can avoid boost dependency.

          shared_ptr is in tr1. Every compiler in use today should have it. random is pretty straightforward with rand_r-- hardly a reason to pull in dependencies. For the rest of the stuff, we should just have thin wrappers around the POSIX or Windows functions, I think.

          I don't think we should depend on boost at any point, since it introduces too many compatibility issues. Boost simply doesn't maintain good compatibility across versions. And then there's issues like what happens if the code using your library is also linking against a different version of boost? It just doesn't work very well.

          It's important to remember that we're writing a library here that clients will use, not a stand-alone application. That means we need to be careful not to assume too much about the context we're running in. Ideally, we'd have only the dependencies that we really need, and we'd provide the ability to shut down the library or run multiple instances of it from different threads of the client application.

          Steve said: I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production.

          Yeah, it would be nice to have a cross-platform client. I don't have easy access to MacOS (it's proprietary and I don't run it, although some of my co-workers do), but I do like to compile things on FreeBSD to see how things go. We should keep portability in mind.

          Although I havent try other test frameworks, I would recommend gtest, it is small and convenient(just a .cc file can embed into test program). If we are using google c++ coding standard, protobuf, using another google framework seems natural.

          Yeah, gtest would be a nice test framework for this.

          Show
          Colin Patrick McCabe added a comment - Binglin said: I mention c++11 mostly for the new std libraries(thread, lock/condition, random, unique_ptr/shared_ptr, regex), so we can avoid writing lot of common utility code, it's fine we use boost instead, and provide typedefs so c++11 or boost can both be an option, old compiler can use boost instead, new compiler can avoid boost dependency. shared_ptr is in tr1. Every compiler in use today should have it. random is pretty straightforward with rand_r -- hardly a reason to pull in dependencies. For the rest of the stuff, we should just have thin wrappers around the POSIX or Windows functions, I think. I don't think we should depend on boost at any point, since it introduces too many compatibility issues. Boost simply doesn't maintain good compatibility across versions. And then there's issues like what happens if the code using your library is also linking against a different version of boost? It just doesn't work very well. It's important to remember that we're writing a library here that clients will use, not a stand-alone application. That means we need to be careful not to assume too much about the context we're running in. Ideally, we'd have only the dependencies that we really need, and we'd provide the ability to shut down the library or run multiple instances of it from different threads of the client application. Steve said: I'd like it to build on OS/X so that mac builds catch regressions, even if isn't for production. Yeah, it would be nice to have a cross-platform client. I don't have easy access to MacOS (it's proprietary and I don't run it, although some of my co-workers do), but I do like to compile things on FreeBSD to see how things go. We should keep portability in mind. Although I havent try other test frameworks, I would recommend gtest, it is small and convenient(just a .cc file can embed into test program). If we are using google c++ coding standard, protobuf, using another google framework seems natural. Yeah, gtest would be a nice test framework for this.
          Hide
          Binglin Chang added a comment -

          Hi Colin, I see you assign all the jira to yourself now. Thanks for taking this effort.
          I create this jira mainly because I want to help some of the development here. Do you have plan/idea on how to proceed the work?

          Show
          Binglin Chang added a comment - Hi Colin, I see you assign all the jira to yourself now. Thanks for taking this effort. I create this jira mainly because I want to help some of the development here. Do you have plan/idea on how to proceed the work?
          Hide
          Colin Patrick McCabe added a comment -

          I was going to post an RPC client using libuv. libuv is nice because it's cross-platform (including UNIX and Windows), MIT-licensed, and has platform wrapper functions like uv_thread_create and uv_mutex_lock, etc. so we won't have to write our own platform stuff for Linux, Windows, etc. libuv also supports async (TCP) I/O, which I would like to have in the RPC client to provide flexibility.

          I think the best structure is to start with the RPC library in hadoop-common, and then perhaps work on a native HDFS client that uses it in hadoop-hdfs. There will be a lot to do there and we can split up the work. I'm going to try to post something for RPC by next week.

          Show
          Colin Patrick McCabe added a comment - I was going to post an RPC client using libuv. libuv is nice because it's cross-platform (including UNIX and Windows), MIT-licensed, and has platform wrapper functions like uv_thread_create and uv_mutex_lock, etc. so we won't have to write our own platform stuff for Linux, Windows, etc. libuv also supports async (TCP) I/O, which I would like to have in the RPC client to provide flexibility. I think the best structure is to start with the RPC library in hadoop-common, and then perhaps work on a native HDFS client that uses it in hadoop-hdfs. There will be a lot to do there and we can split up the work. I'm going to try to post something for RPC by next week.
          Hide
          Arun C Murthy added a comment -

          +1 for the effort, long overdue.

          Agree that starting with RPC client is the right first step.

          I also have a barebones client for golang: https://github.com/hortonworks/gohadoop which I'm happy to throw in if there is sufficient interest.

          Show
          Arun C Murthy added a comment - +1 for the effort, long overdue. Agree that starting with RPC client is the right first step. I also have a barebones client for golang: https://github.com/hortonworks/gohadoop which I'm happy to throw in if there is sufficient interest.
          Hide
          Binglin Chang added a comment -

          Hi Colin, what's the status of the work now? Could you post some of the work so other's can cooperate?

          Show
          Binglin Chang added a comment - Hi Colin, what's the status of the work now? Could you post some of the work so other's can cooperate?
          Hide
          Colin Patrick McCabe added a comment -

          Hi Binglin,

          I posted the RPC code to HADOOP-10389. It still needs a little more testing! I also posted some more subtasks and created a branch. Would you like to be a branch committer for the HADOOP-10388 branch?

          Show
          Colin Patrick McCabe added a comment - Hi Binglin, I posted the RPC code to HADOOP-10389 . It still needs a little more testing! I also posted some more subtasks and created a branch. Would you like to be a branch committer for the HADOOP-10388 branch?
          Hide
          Binglin Chang added a comment -

          Thanks for posting this Colin, looking into the code right now. wenwupeng and I both got branch committer invitation today. His is interest in providing more test for the feature.
          About the code and created sub-jiras, here are some initial questions:

          1. What will the project structure looks like? A separate top-level hadoop-native-client-project? Or seperate code files in common/hdfs/yarn existing dirs?
          2. Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds like core.
          3. I'm surprised the code turn to pure c, it seems because of this, we are introducing strange libraries and tools(protobuf-c(last release in 2011) and the tool shorten), about test library, cpp library gtest is not going to be used too? In short, what libraries are planned to be used?
          4. I like the library to be lightweight, some people just want a header file and a static linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature: xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better be optional, not required.
          Show
          Binglin Chang added a comment - Thanks for posting this Colin, looking into the code right now. wenwupeng and I both got branch committer invitation today. His is interest in providing more test for the feature. About the code and created sub-jiras, here are some initial questions: What will the project structure looks like? A separate top-level hadoop-native-client-project? Or seperate code files in common/hdfs/yarn existing dirs? Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds like core. I'm surprised the code turn to pure c, it seems because of this, we are introducing strange libraries and tools(protobuf-c(last release in 2011) and the tool shorten), about test library, cpp library gtest is not going to be used too? In short, what libraries are planned to be used? I like the library to be lightweight, some people just want a header file and a static linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature: xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better be optional, not required.
          Hide
          Colin Patrick McCabe added a comment -

          What will the project structure looks like? A separate top-level hadoop-native-client-project? Or seperate code files in common/hdfs/yarn existing dirs?

          I think a separate top-level project is best, since this will allow the YARN native client and the HDFS native client to share code much more easily. They will have a lot of shared code. We can have a Maven profile that causes this subproject to be built.

          Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds like core.

          I guess my thinking here is that these libraries are speaking the core hadoop protocol. I am open to other names if you have something better. One problem with choosing a name is that "libhdfs" and "libhadoop" are already taken. We also already have directories named "native," so that would be confusing as well. We also need a name that is fairly short, since it will appear in header file names, object names, etc. etc. We could do "libhdfs-ng.so", I guess.

          In short, what libraries are planned to be used?

          libuv, libprotobuf-c, something for XML parsing, something for URI parsing.
          CMake already has a unit test framework called CTest so we use that.

          I like the library to be lightweight, some people just want a header file and a static linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature: xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better be optional, not required.

          I agree that having an option for static linking would be good. We also need to think carefully about compatibility and what the header file will look like.

          The reason for supporting config file parsing is that we want this library to be a drop-in replacement for libhdfs.so. libhdfs.so is a JNI-based library used by a lot of C and C++ projects such as fuse_dfs and Impala, libhdfs.so will read configuration XML files in the usual way just by invoking the Java Configuration code. If this library is not a drop-in replacement for libhdfs.so, most projects simply will not be able to use it. The other reason for supporting config file parsing is that, well, you need some way of configuring the client! If we end up re-inventing the configuration wheel in a different way, that will not be good for anyone.

          Some clients may not want to read XML files, but simply set all the configuration keys themselves. That's fine, and we can support this. We can even make the XML-reading code optional if you want.

          Thread pools and async I/O, I'm afraid, are something we can't live without. The HDFS client needs to do certain operations in the background. If you study the existing DFSOutputStream code, you'll see that the DFSOutputStream does transfers in the background while the client continues to fill a buffer. This is essential to get good performance, since otherwise we'd have to stop and wait for the packet to be written to all 3 datanodes in the pipeline every time our 64kb chunk filled up. Take a look at the existing HDFS client code to get a sense for what a native client would be like.

          Show
          Colin Patrick McCabe added a comment - What will the project structure looks like? A separate top-level hadoop-native-client-project? Or seperate code files in common/hdfs/yarn existing dirs? I think a separate top-level project is best, since this will allow the YARN native client and the HDFS native client to share code much more easily. They will have a lot of shared code. We can have a Maven profile that causes this subproject to be built. Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds like core. I guess my thinking here is that these libraries are speaking the core hadoop protocol. I am open to other names if you have something better. One problem with choosing a name is that "libhdfs" and "libhadoop" are already taken. We also already have directories named "native," so that would be confusing as well. We also need a name that is fairly short, since it will appear in header file names, object names, etc. etc. We could do "libhdfs-ng.so", I guess. In short, what libraries are planned to be used? libuv, libprotobuf-c, something for XML parsing, something for URI parsing. CMake already has a unit test framework called CTest so we use that. I like the library to be lightweight, some people just want a header file and a static linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature: xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better be optional, not required. I agree that having an option for static linking would be good. We also need to think carefully about compatibility and what the header file will look like. The reason for supporting config file parsing is that we want this library to be a drop-in replacement for libhdfs.so. libhdfs.so is a JNI-based library used by a lot of C and C++ projects such as fuse_dfs and Impala, libhdfs.so will read configuration XML files in the usual way just by invoking the Java Configuration code. If this library is not a drop-in replacement for libhdfs.so, most projects simply will not be able to use it. The other reason for supporting config file parsing is that, well, you need some way of configuring the client! If we end up re-inventing the configuration wheel in a different way, that will not be good for anyone. Some clients may not want to read XML files, but simply set all the configuration keys themselves. That's fine, and we can support this. We can even make the XML-reading code optional if you want. Thread pools and async I/O, I'm afraid, are something we can't live without. The HDFS client needs to do certain operations in the background. If you study the existing DFSOutputStream code, you'll see that the DFSOutputStream does transfers in the background while the client continues to fill a buffer. This is essential to get good performance, since otherwise we'd have to stop and wait for the packet to be written to all 3 datanodes in the pipeline every time our 64kb chunk filled up. Take a look at the existing HDFS client code to get a sense for what a native client would be like.
          Hide
          Binglin Chang added a comment -

          We can even make the XML-reading code optional if you want.

          Sure, if for compatibility I guess add xml support if fine. By keeping strict compatibility we may need to support all javax xml / hadoop config features, I'm afraid libexpact/libxml2 support all of those, a lot effort may be spent on this, it is better to make it optional and do it later I think.

          Thread pools and async I/O, I'm afraid, are something we can't live without.

          I am also prefer to use async I/O and thread for performance reasons, the code I published on github already have a working HDFS client with read/write, and HDFSOuputstream uses an aditional thread.
          What I was saying is use of more threads should be limited, in java client, to simply read/write a HDFS file, too much threads are used(rpc socket read/write, data transfer socket read/write, other misc executors, lease renewer etc.) Since we use async i/o, thread number should be rapidly reduced

          Show
          Binglin Chang added a comment - We can even make the XML-reading code optional if you want. Sure, if for compatibility I guess add xml support if fine. By keeping strict compatibility we may need to support all javax xml / hadoop config features, I'm afraid libexpact/libxml2 support all of those, a lot effort may be spent on this, it is better to make it optional and do it later I think. Thread pools and async I/O, I'm afraid, are something we can't live without. I am also prefer to use async I/O and thread for performance reasons, the code I published on github already have a working HDFS client with read/write, and HDFSOuputstream uses an aditional thread. What I was saying is use of more threads should be limited, in java client, to simply read/write a HDFS file, too much threads are used(rpc socket read/write, data transfer socket read/write, other misc executors, lease renewer etc.) Since we use async i/o, thread number should be rapidly reduced
          Hide
          Zhanwei Wang added a comment -

          Hi all

          I have open sourced the libhdfs3, which is a native C/C++ client developed by Pivotal and used in HAWQ. See HDFS-6994

          Show
          Zhanwei Wang added a comment - Hi all I have open sourced the libhdfs3, which is a native C/C++ client developed by Pivotal and used in HAWQ. See HDFS-6994
          Hide
          Thanh Do added a comment -

          Colin, this is great stuff. Thanks for doing this!

          Show
          Thanh Do added a comment - Colin, this is great stuff. Thanks for doing this!
          Hide
          Colin Patrick McCabe added a comment -

          Thanks, Thanh Do. But there are some equally important people here: thank Zhanwei Wang for his contributions, and Abraham Elmahrek and all the other people who have reviewed things!

          Show
          Colin Patrick McCabe added a comment - Thanks, Thanh Do . But there are some equally important people here: thank Zhanwei Wang for his contributions, and Abraham Elmahrek and all the other people who have reviewed things!

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Binglin Chang
            • Votes:
              0 Vote for this issue
              Watchers:
              43 Start watching this issue

              Dates

              • Created:
                Updated:

                Development