Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: nfs
    • Labels:
      None

      Description

      Access HDFS is usually done through HDFS Client or webHDFS. Lack of seamless integration with client’s file system makes it difficult for users and impossible for some applications to access HDFS. NFS interface support is one way for HDFS to have such easy integration.

      This JIRA is to track the NFS protocol support for accessing HDFS. With HDFS client, webHDFS and the NFS interface, HDFS will be easier to access and be able support more applications and use cases.

      We will upload the design document and the initial implementation.

      1. HADOOP-NFS-Proposal.pdf
        221 kB
        Brandon Li
      2. HDFS-4750.patch
        371 kB
        Brandon Li
      3. nfs-trunk.patch
        368 kB
        Brandon Li

        Issue Links

        1.
        Provide HDFS based NFSv3 and Mountd implementation Sub-task Closed Brandon Li
         
        2.
        Add script changes/utility for starting NFS gateway Sub-task Closed Brandon Li
         
        3.
        Add NFS server export table to control export by hostname or IP range Sub-task Closed Jing Zhao
         
        4.
        Use enum for nfs constants Sub-task Closed Tsz Wo Nicholas Sze
         
        5.
        Move IO operations out of locking in OpenFileCtx Sub-task Closed Jing Zhao
         
        6.
        Support symlink operations Sub-task Closed Brandon Li
         
        7.
        Include hadoop-nfs and hadoop-hdfs-nfs into hadoop dist for NFS deployment Sub-task Closed Brandon Li
         
        8.
        Change hdfs-nfs parent project to hadoop-project Sub-task Closed Brandon Li
         
        9.
        Support file append in NFSv3 gateway to enable data streaming to HDFS Sub-task Closed Brandon Li
         
        10. Add namespace ID and snapshot ID into fileHandle to support Federation and Snapshot Sub-task Open Unassigned
         
        11.
        Refactor o.a.h.nfs to support different types of authentications Sub-task Closed Jing Zhao
         
        12.
        Support dotdot name in NFS LOOKUP operation Sub-task Closed Brandon Li
         
        13.
        Fix array copy error in Readdir and Readdirplus responses Sub-task Closed Brandon Li
         
        14.
        Change FSDataOutputStream to HdfsDataOutputStream for opened streams to fix type cast error Sub-task Closed Brandon Li
         
        15. Create a test framework to enable NFS end to end unit test Sub-task Open Unassigned
         
        16.
        MNT EXPORT should give the full group list which can mount the exports Sub-task Closed Brandon Li
         
        17.
        Improve WriteManager for processing stable write requests and commit requests Sub-task Resolved Jing Zhao
         
        18.
        NFS should create input stream for a file and try to share it with multiple read requests Sub-task Closed Haohui Mai
         
        19.
        Handle race condition for writes Sub-task Resolved Brandon Li
         
        20.
        Add more debug trace for NFS READ and WRITE Sub-task Closed Brandon Li
         
        21.
        Include NFS jars in the maven assembly Sub-task Resolved Unassigned
         
        22.
        Refactor RpcMessage and NFS3Response to support different types of authentication information Sub-task Closed Jing Zhao
         
        23.
        Introduce RpcInfo to decouple XDR classes from the RPC API Sub-task Resolved Haohui Mai
         
        24.
        Move RpcFrameDecoder out of the public API Sub-task Closed Haohui Mai
         
        25.
        Remove excessive copying due to XDR Sub-task Resolved Haohui Mai
         
        26.
        Fix dumper thread which may die silently Sub-task Closed Brandon Li
         
        27.
        Stable write is not handled correctly in someplace Sub-task Closed Brandon Li
         
        28.
        Support client which combines appended data with old data before sends it to NFS server Sub-task Resolved Brandon Li
         
        29.
        COMMIT request should not block Sub-task Closed Brandon Li
         
        30.
        Make Hadoop nfs server port and mount daemon port configurable Sub-task Resolved Jinghui Wang
         
        31.
        Close idle connections in portmap Sub-task Closed Haohui Mai
         
        32.
        fix readdir and readdirplus for large directories Sub-task Closed Brandon Li
         
        33.
        should do hsync for a commit request even there is no pending writes Sub-task Closed Brandon Li
         
        34.
        add HDFS NFS user guide Sub-task Closed Brandon Li
         
        35.
        Add OpenFileCtx cache Sub-task Closed Brandon Li
         
        36. Add more unit tests for the inputstream cache Sub-task Open Unassigned
         
        37. Add more unit tests for the data streaming Sub-task Open Unassigned
         
        38.
        COMMIT handler should update the commit status after sync Sub-task Closed Brandon Li
         

          Activity

          Hide
          Brandon Li added a comment -

          Uploaded the initial design doc.

          Show
          Brandon Li added a comment - Uploaded the initial design doc.
          Hide
          Brock Noland added a comment -

          Hi Brandon,

          Great to see this proposal! I am fine with using a new JIRA for this, but if we do so, should HDFS-252 be closed as a duplicate? As you know I created an Apache Licensed NFS4 proxy for HDFS (https://github.com/cloudera/hdfs-nfs-proxy). I have a couple questions/comments:

          The proposal says that waiting "10 milliseconds" should be able to convert most writes to sequential writes. I am curious has this been tested under load on modern kernels? The reason I ask is that I found that often the NFS4 proxy has to wait much longer than 10 milliseconds to receive the pre-requisite writes. It's possible that behavior is NFS4 only.

          Before implementing the NFS4 proxy I implemented a NFS3 proxy as you propose. Unfortunately I deleted the git repo when I became frustrated with the mismatch between NFS3 and HDFS semantics. If I remember correctly, one example was that when I had a small file, a small append resulted in a write of the entire file. I cannot remember exactly how it behaved with larger files. Have you encountered this? If so, how will it be handled?

          Another problem I ran into was that since NFS3 doesn't have a close, I was never sure when to close the HDFS file handle. I see that you plan to handle this by idle closing file handles. I thought about this approach as well but my concern was it will often be the case that there is data which has not been "synced" to HDFS when the native program has closed the file. Therefore there are races with other clients being able to see that data. I am not 100% up the latest of when a file length is updated in HDFS, but I believe there is a similar issue with the length metadata as well. How will this be handled?

          Once again, great work on the proposal!

          Cheers,
          Brock

          Show
          Brock Noland added a comment - Hi Brandon, Great to see this proposal! I am fine with using a new JIRA for this, but if we do so, should HDFS-252 be closed as a duplicate? As you know I created an Apache Licensed NFS4 proxy for HDFS ( https://github.com/cloudera/hdfs-nfs-proxy ). I have a couple questions/comments: The proposal says that waiting "10 milliseconds" should be able to convert most writes to sequential writes. I am curious has this been tested under load on modern kernels? The reason I ask is that I found that often the NFS4 proxy has to wait much longer than 10 milliseconds to receive the pre-requisite writes. It's possible that behavior is NFS4 only. Before implementing the NFS4 proxy I implemented a NFS3 proxy as you propose. Unfortunately I deleted the git repo when I became frustrated with the mismatch between NFS3 and HDFS semantics. If I remember correctly, one example was that when I had a small file, a small append resulted in a write of the entire file. I cannot remember exactly how it behaved with larger files. Have you encountered this? If so, how will it be handled? Another problem I ran into was that since NFS3 doesn't have a close, I was never sure when to close the HDFS file handle. I see that you plan to handle this by idle closing file handles. I thought about this approach as well but my concern was it will often be the case that there is data which has not been "synced" to HDFS when the native program has closed the file. Therefore there are races with other clients being able to see that data. I am not 100% up the latest of when a file length is updated in HDFS, but I believe there is a similar issue with the length metadata as well. How will this be handled? Once again, great work on the proposal! Cheers, Brock
          Hide
          Brandon Li added a comment -

          Hi Brock, I was about to send you the link to this JIRA. Glad you already noticed it.
          I've resolved HDFS-252 as a dup.
          I studied your NFSv4 implementation and learned a lot from it. Thank you!

          "10 milliseconds" is the time from the reference paper. In the initial implementation, we used 10 seconds just to be on the safe side.

          Regarding small file append, it starts from the correct offset in the tests I observed. For example, I tried "echo abcd >> /mnt_test/file_with_5bytes", the write request starts with offset 5. With the initial file loading tests with Linux/Mac clients, so far we haven't encountered the problem you mentioned.

          For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway. For the opened files, NFS gateway also saves their latest file size. When it serves getattr request, it gets file attributes from HDFS and then update the file length based on the cached length.

          Thanks!
          Brandon

          Show
          Brandon Li added a comment - Hi Brock, I was about to send you the link to this JIRA. Glad you already noticed it. I've resolved HDFS-252 as a dup. I studied your NFSv4 implementation and learned a lot from it. Thank you! "10 milliseconds" is the time from the reference paper. In the initial implementation, we used 10 seconds just to be on the safe side. Regarding small file append, it starts from the correct offset in the tests I observed. For example, I tried "echo abcd >> /mnt_test/file_with_5bytes", the write request starts with offset 5. With the initial file loading tests with Linux/Mac clients, so far we haven't encountered the problem you mentioned. For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway. For the opened files, NFS gateway also saves their latest file size. When it serves getattr request, it gets file attributes from HDFS and then update the file length based on the cached length. Thanks! Brandon
          Hide
          Todd Lipcon added a comment -

          For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway

          This precludes having multiple NFS gateways in operation simultaneously for increased throughput, right? I'd imagine the typical deployment scenario would be to run an NFS gateway on every machine and then have them mount localhost in order to avoid a bottleneck scenario. Even in a data loading situation, I'd expect a set of several "gateway nodes" to be used in round-robin in order to increase ingest throughput beyond what a single host can handle.

          Show
          Todd Lipcon added a comment - For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway This precludes having multiple NFS gateways in operation simultaneously for increased throughput, right? I'd imagine the typical deployment scenario would be to run an NFS gateway on every machine and then have them mount localhost in order to avoid a bottleneck scenario. Even in a data loading situation, I'd expect a set of several "gateway nodes" to be used in round-robin in order to increase ingest throughput beyond what a single host can handle.
          Hide
          Brandon Li added a comment -

          This precludes having multiple NFS gateways in operation simultaneously for increased throughput, right?

          Not necessarily, it depends on the workloads and the application requirement.

          Even for a regular NFS server mounted to multiple clients, it could have the same issue. One way to synchronize the clienB-read-after-clienA-write is to use NFS lock manager(NLM) protocol(along with Network Status Monitor (NSM) protocol). In the first phase, it seems a bit overkill for the user cases we want to support.

          Even in a data loading situation, I'd expect a set of several "gateway nodes" to be used in round-robin in order to increase ingest throughput beyond what a single host can handle.

          Here what I want to mention is, as also in the proposal, one benefit of NFS support is to make it easier to integrate HDFS into client's file system namespace. The performance of NFS gateway is usually slower than using DFSClient directly.

          Loading file through NFS gateway can be faster than DFSClient only in a few cases, such as unstable writes with no commit after them immediately.

          With that said, its performance can be improved in the future by a few ways, such as better caching, pNFS support and etc.

          Show
          Brandon Li added a comment - This precludes having multiple NFS gateways in operation simultaneously for increased throughput, right? Not necessarily, it depends on the workloads and the application requirement. Even for a regular NFS server mounted to multiple clients, it could have the same issue. One way to synchronize the clienB-read-after-clienA-write is to use NFS lock manager(NLM) protocol(along with Network Status Monitor (NSM) protocol). In the first phase, it seems a bit overkill for the user cases we want to support. Even in a data loading situation, I'd expect a set of several "gateway nodes" to be used in round-robin in order to increase ingest throughput beyond what a single host can handle. Here what I want to mention is, as also in the proposal, one benefit of NFS support is to make it easier to integrate HDFS into client's file system namespace. The performance of NFS gateway is usually slower than using DFSClient directly. Loading file through NFS gateway can be faster than DFSClient only in a few cases, such as unstable writes with no commit after them immediately. With that said, its performance can be improved in the future by a few ways, such as better caching, pNFS support and etc.
          Hide
          Brock Noland added a comment -

          Hi Brandon,

          Thank you for the quick response!

          "10 milliseconds" is the time from the reference paper. In the initial implementation, we used 10 seconds just to be on the safe side.

          What happens if the 10 seconds expires and the prerequisite write has not been received? The biggest issue I had when moving the proxy from basically working to handling multiple heavy use clients was memory consumption while waiting for pre-requisite writes. I eventually had to write pending writes to a file. This is documented in this issue https://github.com/brockn/hdfs-nfs-proxy/issues/7

          Regarding small file append, it starts from the correct offset in the tests I observed. For example, I tried "echo abcd >> /mnt_test/file_with_5bytes", the write request starts with offset 5. With the initial file loading tests with Linux/Mac clients, so far we haven't encountered the problem you mentioned.

          Interesting, what version of linux have tried? I believe I was using RHEL 5.X.

          For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway. For the opened files, NFS gateway also saves their latest file size. When it serves getattr request, it gets file attributes from HDFS and then update the file length based on the cached length.

          Cool, my question was more around how we are going to make our users aware of this limitation. I could imagine many users believing once they have closed a file via NFS that exact file will be available via one the other APIs. We'll need to make this limitation blatantly obvious to users otherwise it will likely become a support headache.

          Additionally, is there anything the user can do to force the writes? i.e. If the user has control over the program, could they do a fsync(fd) to force the flush?

          Cheers,
          Brock

          Show
          Brock Noland added a comment - Hi Brandon, Thank you for the quick response! "10 milliseconds" is the time from the reference paper. In the initial implementation, we used 10 seconds just to be on the safe side. What happens if the 10 seconds expires and the prerequisite write has not been received? The biggest issue I had when moving the proxy from basically working to handling multiple heavy use clients was memory consumption while waiting for pre-requisite writes. I eventually had to write pending writes to a file. This is documented in this issue https://github.com/brockn/hdfs-nfs-proxy/issues/7 Regarding small file append, it starts from the correct offset in the tests I observed. For example, I tried "echo abcd >> /mnt_test/file_with_5bytes", the write request starts with offset 5. With the initial file loading tests with Linux/Mac clients, so far we haven't encountered the problem you mentioned. Interesting, what version of linux have tried? I believe I was using RHEL 5.X. For the second question, as long as the second user uses NFS gateway to read the closed file, the second user should be able to get the data buffered in NFS gateway. For the opened files, NFS gateway also saves their latest file size. When it serves getattr request, it gets file attributes from HDFS and then update the file length based on the cached length. Cool, my question was more around how we are going to make our users aware of this limitation. I could imagine many users believing once they have closed a file via NFS that exact file will be available via one the other APIs. We'll need to make this limitation blatantly obvious to users otherwise it will likely become a support headache. Additionally, is there anything the user can do to force the writes? i.e. If the user has control over the program, could they do a fsync(fd) to force the flush? Cheers, Brock
          Hide
          Daryn Sharp added a comment -

          This part seems a bit worrisome:

          The solution is to close the stream after it’s idle(no write) for a certain period(e.g., 10 seconds). The subsequent write will become append and open the stream again.

          This is very semantically wrong. If another client appended to the file in the interim, the file position should not implicitly move to the end of the file. Assuming the proposed approach is otherwise valid: when the client attempts to write again via append, it should throw an exception if the file size is greater than the client's current position in the stream. Even that breaks POSIX semantics, but it's "less wrong" by not causing the potential for garbled data.

          Show
          Daryn Sharp added a comment - This part seems a bit worrisome: The solution is to close the stream after it’s idle(no write) for a certain period(e.g., 10 seconds). The subsequent write will become append and open the stream again. This is very semantically wrong. If another client appended to the file in the interim, the file position should not implicitly move to the end of the file. Assuming the proposed approach is otherwise valid: when the client attempts to write again via append, it should throw an exception if the file size is greater than the client's current position in the stream. Even that breaks POSIX semantics, but it's "less wrong" by not causing the potential for garbled data.
          Hide
          Allen Wittenauer added a comment -

          What are the plans around RPCSEC and GSSAPI mapping capabilities? While I recognize that these are optional to the NFSv3 specs, a lot of folks need them in order to use this feature. It is probably also worth pointing out that NFSv4 and higher fix this mistake and require the security pieces to be there in order to be RFC compliant. So if we want to implement pNFS, we have to do this work anyway.

          Show
          Allen Wittenauer added a comment - What are the plans around RPCSEC and GSSAPI mapping capabilities? While I recognize that these are optional to the NFSv3 specs, a lot of folks need them in order to use this feature. It is probably also worth pointing out that NFSv4 and higher fix this mistake and require the security pieces to be there in order to be RFC compliant. So if we want to implement pNFS, we have to do this work anyway.
          Hide
          Brock Noland added a comment -

          I didn't spend too much time looking at NFSv3 security but FWIW the NFS4 proxy supports Kerberos in privacy mode. This code might be of use.

          Show
          Brock Noland added a comment - I didn't spend too much time looking at NFSv3 security but FWIW the NFS4 proxy supports Kerberos in privacy mode. This code might be of use.
          Hide
          Brandon Li added a comment -

          What are the plans around RPCSEC and GSSAPI mapping capabilities?

          The initial implementation doesn't have it but I agree it would be nice to support it sooner than later.

          ...This code might be of use.

          Sounds like a plan

          Show
          Brandon Li added a comment - What are the plans around RPCSEC and GSSAPI mapping capabilities? The initial implementation doesn't have it but I agree it would be nice to support it sooner than later. ...This code might be of use. Sounds like a plan
          Hide
          Brandon Li added a comment -

          This is very semantically wrong. If another client appended to the file in the interim, the file position should not implicitly move to the end of the file.

          When the stream is closed, the file size is updated in HDFS. Before it's closed, the same client still holds the lease.

          Assuming the proposed approach is otherwise valid: when the client attempts to write again via append, it should throw an exception if the file size is greater than the client's current position in the stream. Even that breaks POSIX semantics, but it's "less wrong" by not causing the potential for garbled data.

          If the file is appended by another client, the first client's new write before the file's <EOF> becomes random write and would fail with exception. What breaks POSIX semantic here is that random write is not support.

          Show
          Brandon Li added a comment - This is very semantically wrong. If another client appended to the file in the interim, the file position should not implicitly move to the end of the file. When the stream is closed, the file size is updated in HDFS. Before it's closed, the same client still holds the lease. Assuming the proposed approach is otherwise valid: when the client attempts to write again via append, it should throw an exception if the file size is greater than the client's current position in the stream. Even that breaks POSIX semantics, but it's "less wrong" by not causing the potential for garbled data. If the file is appended by another client, the first client's new write before the file's <EOF> becomes random write and would fail with exception. What breaks POSIX semantic here is that random write is not support.
          Hide
          Daryn Sharp added a comment -

          If the file is appended by another client, the first client's new write before the file's <EOF> becomes random write and would fail with exception. What breaks POSIX semantic here is that random write is not support.

          Ok, we're in 100% agreement. The doc is just ambiguous.

          Show
          Daryn Sharp added a comment - If the file is appended by another client, the first client's new write before the file's <EOF> becomes random write and would fail with exception. What breaks POSIX semantic here is that random write is not support. Ok, we're in 100% agreement. The doc is just ambiguous.
          Hide
          Brandon Li added a comment -

          @Brock

          What happens if the 10 seconds expires and the prerequisite write has not been received? The biggest issue I had when moving the proxy from basically working to handling multiple heavy use clients was memory consumption while waiting for pre-requisite writes. I eventually had to write pending writes to a file. This is documented in this issue https://github.com/brockn/hdfs-nfs-proxy/issues/7

          The pending write requests will fail after timeout. Saving pending writes in files can help in some cases but also introduces some problems. First, it doesn't eliminate the problem. The prerequisite write may never arrive if 10 seconds is not long enough. Even the prerequisite write arrives finally, the accumulated writes in the file may have timed out. Secondly, it makes the server stateful (or have more state information). To support HA later, we have to move the state information from one NFS gateway to another in order to recover. If the state recovery takes too long to finish, it can cause the clients new requests to fail.
          More testing and research work is needed here.

          Interesting, what version of linux have tried? I believe I was using RHEL 5.X.

          CentOS6.3 and Mac 10.7.5

          Additionally, is there anything the user can do to force the writes? i.e. If the user has control over the program, could they do a fsync(fd) to force the flush?

          fsync could trigger NFS commit, which will sync and persist the data.

          I could imagine many users believing once they have closed a file via NFS that exact file will be available via one the other APIs. We'll need to make this limitation blatantly obvious to users otherwise it will likely become a support headache.

          If the application expects that, after closing file through one NFS gateway, the new data is immediately available to all other NFS gateways, the application should do a sync call after close.

          This is not a limitation only to this NFS implemtation. POSIX close doesn't sync data implicitly.

          Show
          Brandon Li added a comment - @Brock What happens if the 10 seconds expires and the prerequisite write has not been received? The biggest issue I had when moving the proxy from basically working to handling multiple heavy use clients was memory consumption while waiting for pre-requisite writes. I eventually had to write pending writes to a file. This is documented in this issue https://github.com/brockn/hdfs-nfs-proxy/issues/7 The pending write requests will fail after timeout. Saving pending writes in files can help in some cases but also introduces some problems. First, it doesn't eliminate the problem. The prerequisite write may never arrive if 10 seconds is not long enough. Even the prerequisite write arrives finally, the accumulated writes in the file may have timed out. Secondly, it makes the server stateful (or have more state information). To support HA later, we have to move the state information from one NFS gateway to another in order to recover. If the state recovery takes too long to finish, it can cause the clients new requests to fail. More testing and research work is needed here. Interesting, what version of linux have tried? I believe I was using RHEL 5.X. CentOS6.3 and Mac 10.7.5 Additionally, is there anything the user can do to force the writes? i.e. If the user has control over the program, could they do a fsync(fd) to force the flush? fsync could trigger NFS commit, which will sync and persist the data. I could imagine many users believing once they have closed a file via NFS that exact file will be available via one the other APIs. We'll need to make this limitation blatantly obvious to users otherwise it will likely become a support headache. If the application expects that, after closing file through one NFS gateway, the new data is immediately available to all other NFS gateways, the application should do a sync call after close. This is not a limitation only to this NFS implemtation. POSIX close doesn't sync data implicitly.
          Hide
          Todd Lipcon added a comment -

          If the application expects that, after closing file through one NFS gateway, the new data is immediately available to all other NFS gateways, the application should do a sync call after close.
          This is not a limitation only to this NFS implemtation. POSIX close doesn't sync data implicitly.

          I don't think this is right. POSIX doesn't ensure that close() syncs data (makes it durable), but NFS does require that close() makes it visible to other clients (so-called "close-to-open consistency"):

          The NFS standard requires clients to maintain close-to-open cache coherency when multiple clients access the same files [5, 6, 10]. This means flushing all file data and metadata changes when a client closes a file, and immediately and unconditionally retrieving a file's attributes when it is opened via the open() system call API. In this way, changes made by one client appear as soon as a file is opened on any other client.

          (from http://www.citi.umich.edu/projects/nfs-perf/results/cel/dnlc.html)

          Show
          Todd Lipcon added a comment - If the application expects that, after closing file through one NFS gateway, the new data is immediately available to all other NFS gateways, the application should do a sync call after close. This is not a limitation only to this NFS implemtation. POSIX close doesn't sync data implicitly. I don't think this is right. POSIX doesn't ensure that close() syncs data (makes it durable), but NFS does require that close() makes it visible to other clients (so-called "close-to-open consistency"): The NFS standard requires clients to maintain close-to-open cache coherency when multiple clients access the same files [5, 6, 10] . This means flushing all file data and metadata changes when a client closes a file, and immediately and unconditionally retrieving a file's attributes when it is opened via the open() system call API. In this way, changes made by one client appear as soon as a file is opened on any other client. (from http://www.citi.umich.edu/projects/nfs-perf/results/cel/dnlc.html )
          Hide
          Hari Mankude added a comment -

          Implementing writes might not be easy. The client implementations in various kernels does not guarantee that the writes are issued in sequential order. Page flushing algorithms try to find contiguous pages (offsets). However, there are other factors in play with page flushing algorithms. So it does not imply that writes from the client has to be sequential as HDFS requires it to be. This is true whether the writes are coming in lazily from the client or due to a sync() before close(). A possible solution is for nfs gateway on dfs client to cache and reorder the writes to be sequential. But, this might still result in "holes" which hdfs cannot handle. Also, the cache requirements might not be trivial and might require a flush to local disk.

          NFS interfaces are very useful for reads.

          Show
          Hari Mankude added a comment - Implementing writes might not be easy. The client implementations in various kernels does not guarantee that the writes are issued in sequential order. Page flushing algorithms try to find contiguous pages (offsets). However, there are other factors in play with page flushing algorithms. So it does not imply that writes from the client has to be sequential as HDFS requires it to be. This is true whether the writes are coming in lazily from the client or due to a sync() before close(). A possible solution is for nfs gateway on dfs client to cache and reorder the writes to be sequential. But, this might still result in "holes" which hdfs cannot handle. Also, the cache requirements might not be trivial and might require a flush to local disk. NFS interfaces are very useful for reads.
          Hide
          Brandon Li added a comment -

          ...but NFS does require that close() makes it visible to other clients (so-called "close-to-open consistency")

          The protocol provides no facility to guarantee the cached data is consistent with that on server. But "close-to-open consistency" is recommended for implementation.

          Show
          Brandon Li added a comment - ...but NFS does require that close() makes it visible to other clients (so-called "close-to-open consistency") The protocol provides no facility to guarantee the cached data is consistent with that on server. But "close-to-open consistency" is recommended for implementation.
          Hide
          Brandon Li added a comment -

          Hi folks,

          I plan to split and upload the initial implementation to 4 JIRAs (HADOOP-9509,HADOOP-9515,HDFS-4762,HDFS-4763). These changes are independent with current Hadoop code base. But, if it's preferred to do the change in a different branch, please let me know.

          Thanks,
          Brandon

          Show
          Brandon Li added a comment - Hi folks, I plan to split and upload the initial implementation to 4 JIRAs ( HADOOP-9509 , HADOOP-9515 , HDFS-4762 , HDFS-4763 ). These changes are independent with current Hadoop code base. But, if it's preferred to do the change in a different branch, please let me know. Thanks, Brandon
          Hide
          Suresh Srinivas added a comment -

          But, if it's preferred to do the change in a different branch, please let me know.

          Since these do changes do not lend trunk unstable, I am okay not having a branch for this development. If I do not hear a differing opinion, I will start reviewing and merging this patch from next week.

          Show
          Suresh Srinivas added a comment - But, if it's preferred to do the change in a different branch, please let me know. Since these do changes do not lend trunk unstable, I am okay not having a branch for this development. If I do not hear a differing opinion, I will start reviewing and merging this patch from next week.
          Hide
          Hari Mankude added a comment -

          I would recommend thinking through NFS write operations. The client does caching and page cache can result in lots of weirdness. For example, as long as the data is cached in client's page cache, client can do random writes and overwrites. When page cache is flushed to hdfs data store, some writes would fail (translate to overwrites in hdfs) while others might succeed (offsets happen to be append).

          An alternative to consider to support NFS writes is to require clients do NFS mounts with directio enabled. Directio will bypass client cache and might alleviate some of the funky behavior.

          Show
          Hari Mankude added a comment - I would recommend thinking through NFS write operations. The client does caching and page cache can result in lots of weirdness. For example, as long as the data is cached in client's page cache, client can do random writes and overwrites. When page cache is flushed to hdfs data store, some writes would fail (translate to overwrites in hdfs) while others might succeed (offsets happen to be append). An alternative to consider to support NFS writes is to require clients do NFS mounts with directio enabled. Directio will bypass client cache and might alleviate some of the funky behavior.
          Hide
          Todd Lipcon added a comment -

          Looking at some of the patches that have been posted, it appears that this project is entirely new/separate code from the rest of Hadoop. What is the purpose of putting it in Hadoop proper rather than proposing it as a separate project (eg in the incubator)? Bundling it with Hadoop has the downside that it makes our releases even bigger, whereas the general feeling of late has been that we should try to keep things out of 'core' (eg we removed a bunch of former contrib projects)

          Show
          Todd Lipcon added a comment - Looking at some of the patches that have been posted, it appears that this project is entirely new/separate code from the rest of Hadoop. What is the purpose of putting it in Hadoop proper rather than proposing it as a separate project (eg in the incubator)? Bundling it with Hadoop has the downside that it makes our releases even bigger, whereas the general feeling of late has been that we should try to keep things out of 'core' (eg we removed a bunch of former contrib projects)
          Hide
          Brock Noland added a comment -

          Hari,

          Yes, this a major concern. Any clients will have to be well behaved and explicitly not perform random writes on the client side. With NFS4, as long as the client application is not performing random writes, the linux NFS4 client does not appear attempt any random writes. As I mentioned earlier I thought I saw the linux NFS3 client perform random writes for a sequentially writing client. Hopefully I was mistaken.

          In regard to handling the normal write-reordering which occurs under both NFS3 and 4, the approach I took in the NFS4 proxy was to buffer any non-sequential writes until I could write them sequentially. As you said previously this can lead unbounded memory consumption. Therefore in the NFS4 proxy, if a write request is greater than 1MB from the current file offset, it's written out to a log file and the in-memory object simply stores the file name, offset, and length. I've found this method to work quite well.

          Brock

          Show
          Brock Noland added a comment - Hari, Yes, this a major concern. Any clients will have to be well behaved and explicitly not perform random writes on the client side. With NFS4, as long as the client application is not performing random writes, the linux NFS4 client does not appear attempt any random writes. As I mentioned earlier I thought I saw the linux NFS3 client perform random writes for a sequentially writing client. Hopefully I was mistaken. In regard to handling the normal write-reordering which occurs under both NFS3 and 4, the approach I took in the NFS4 proxy was to buffer any non-sequential writes until I could write them sequentially. As you said previously this can lead unbounded memory consumption. Therefore in the NFS4 proxy, if a write request is greater than 1MB from the current file offset, it's written out to a log file and the in-memory object simply stores the file name, offset, and length. I've found this method to work quite well. Brock
          Hide
          Allen Wittenauer added a comment -

          Have we run any of these against SPEC SFS? What does iozone do with this? Any clients besides Linux and Mac OS X? (FWIW: OS X's NFS client has always been a bit flaky...) Have we thought about YANFS support?

          Show
          Allen Wittenauer added a comment - Have we run any of these against SPEC SFS? What does iozone do with this? Any clients besides Linux and Mac OS X? (FWIW: OS X's NFS client has always been a bit flaky...) Have we thought about YANFS support?
          Hide
          Andrew Purtell added a comment -

          What does iozone do with this?

          This is a great question. Or fio, or another of the usual.

          Show
          Andrew Purtell added a comment - What does iozone do with this? This is a great question. Or fio, or another of the usual.
          Hide
          Brandon Li added a comment -

          @Allen,Andrew

          This is a great question. Or fio, or another of the usual.

          The code is in its very early stage. We have done little performance test. We did some tests with Cthon04 and NFStest(from NetApp). We will do some performance tests once the code is relatively stable.

          Show
          Brandon Li added a comment - @Allen,Andrew This is a great question. Or fio, or another of the usual. The code is in its very early stage. We have done little performance test. We did some tests with Cthon04 and NFStest(from NetApp). We will do some performance tests once the code is relatively stable.
          Hide
          Andrew Purtell added a comment -

          We will do some performance tests once the code is relatively stable.

          Would be happy to help with that when you think the code is ready.

          Show
          Andrew Purtell added a comment - We will do some performance tests once the code is relatively stable. Would be happy to help with that when you think the code is ready.
          Hide
          Han Xiao added a comment -

          This feature is intersting.
          However, I'am sorry for not having understood the desription in the pdf that "Hadoop+FUSE could be used to provide an NFS interface for HDFS. However it has many known problems and limitations."
          Could it be explained more detailed or a link would also be appreciated. We want to use this fuction alike to do backuping tasks with normal backup software. So the information is very important to us.
          Thank you.

          Show
          Han Xiao added a comment - This feature is intersting. However, I'am sorry for not having understood the desription in the pdf that "Hadoop+FUSE could be used to provide an NFS interface for HDFS. However it has many known problems and limitations." Could it be explained more detailed or a link would also be appreciated. We want to use this fuction alike to do backuping tasks with normal backup software. So the information is very important to us. Thank you.
          Hide
          Brandon Li added a comment -

          Uploaded the patch.
          Before split it into a few JIRAs, I now temporarily put the nfs implementation only under hdfs to make one patch. The test classes are not included.

          Some subsequent JIRAs will be filed later to address security, stability and other issues.

          To do some tests with current code: make sure to stop nfs service provided by the platform, keep rpcbind(or portmap) running,
          1. start hdfs
          2. start nfs gateway using "hadoop nfs3". The nfs gateway has both mountd and nfsd. It has one export HDFS root "/" rw to everyone.
          3. mount export to the client, using option such as "-o soft,vers=3,proto=tcp,nolock". Make sure the users on client and server hosts are in synce since nfs gateway uses AUTH_SYS authentication.

          Show
          Brandon Li added a comment - Uploaded the patch. Before split it into a few JIRAs, I now temporarily put the nfs implementation only under hdfs to make one patch. The test classes are not included. Some subsequent JIRAs will be filed later to address security, stability and other issues. To do some tests with current code: make sure to stop nfs service provided by the platform, keep rpcbind(or portmap) running, 1. start hdfs 2. start nfs gateway using "hadoop nfs3". The nfs gateway has both mountd and nfsd. It has one export HDFS root "/" rw to everyone. 3. mount export to the client, using option such as "-o soft,vers=3,proto=tcp,nolock". Make sure the users on client and server hosts are in synce since nfs gateway uses AUTH_SYS authentication.
          Hide
          Brandon Li added a comment -

          @Andrew

          Would be happy to help with that when you think the code is ready.


          Thanks! I only did cthon04 basic test(no symlink) with the uploaded patch on centos. Please feel free to give it a try.

          Show
          Brandon Li added a comment - @Andrew Would be happy to help with that when you think the code is ready. Thanks! I only did cthon04 basic test(no symlink) with the uploaded patch on centos. Please feel free to give it a try.
          Hide
          Brandon Li added a comment -

          @Allen

          Any clients besides Linux and Mac OS X? (FWIW: OS X's NFS client has always been a bit flaky...) Have we thought about YANFS support?


          Weeks ago, did some manual tests with Window NFSv3 client before we changed the RPC authentication support form AUTH_NULL to AUTH_SYS. We didn't try it again after the change. Mapping Windows users to Unix users may be needed to test it again.

          We looked a few other NFS implementations. Eventually we decide to implement one. The major reason is that, the NFS should work around a few HDFS limitations, and also tightly coupled with HDFS protocols.

          Show
          Brandon Li added a comment - @Allen Any clients besides Linux and Mac OS X? (FWIW: OS X's NFS client has always been a bit flaky...) Have we thought about YANFS support? Weeks ago, did some manual tests with Window NFSv3 client before we changed the RPC authentication support form AUTH_NULL to AUTH_SYS. We didn't try it again after the change. Mapping Windows users to Unix users may be needed to test it again. We looked a few other NFS implementations. Eventually we decide to implement one. The major reason is that, the NFS should work around a few HDFS limitations, and also tightly coupled with HDFS protocols.
          Hide
          Brandon Li added a comment -

          @Todd

          What is the purpose of putting it in Hadoop proper rather than proposing it as a separate project (eg in the incubator)?


          What we were thinking was that, as mentioned above, the NFS gateway is tightly coupled with HDFS protocols. Current code is still controlled at a small size. Also, some code is so general (e.g., oncrpc implementation) that can be used by other possible projects.

          Show
          Brandon Li added a comment - @Todd What is the purpose of putting it in Hadoop proper rather than proposing it as a separate project (eg in the incubator)? What we were thinking was that, as mentioned above, the NFS gateway is tightly coupled with HDFS protocols. Current code is still controlled at a small size. Also, some code is so general (e.g., oncrpc implementation) that can be used by other possible projects.
          Hide
          Brandon Li added a comment -

          @Hari

          An alternative to consider to support NFS writes is to require clients do NFS mounts with directio enabled. Directio will bypass client cache and might alleviate some of the funky behavior.


          Yes, directIO could help reduce the kernel reordered writes. Solaris supports it using the forcedirectio for mount. Linux seems not have a corresponding mount option.

          Show
          Brandon Li added a comment - @Hari An alternative to consider to support NFS writes is to require clients do NFS mounts with directio enabled. Directio will bypass client cache and might alleviate some of the funky behavior. Yes, directIO could help reduce the kernel reordered writes. Solaris supports it using the forcedirectio for mount. Linux seems not have a corresponding mount option.
          Hide
          Brandon Li added a comment -

          @Han

          ..not having understood the description in the pdf that "Hadoop+FUSE could be used to provide an NFS interface for HDFS. However it has many known problems and limitations."


          For example, FUSE is not inode based like NFS. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file's path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far. ...

          Show
          Brandon Li added a comment - @Han ..not having understood the description in the pdf that "Hadoop+FUSE could be used to provide an NFS interface for HDFS. However it has many known problems and limitations." For example, FUSE is not inode based like NFS. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file's path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far. ...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12581119/nfs-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1371 javac compiler warnings (more than the trunk's current 1366 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 19 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581119/nfs-trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1371 javac compiler warnings (more than the trunk's current 1366 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 19 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4343//console This message is automatically generated.
          Hide
          Andrew Purtell added a comment -

          Brandon Li

          Would be happy to help with that when you think the code is ready.

          Thanks! I only did cthon04 basic test(no symlink) with the uploaded patch on centos. Please feel free to give it a try.

          We had reserved some time this week over here to give this a spin, but unfortunately something has come up. Maybe there will still be an opportunity to be helpful down the road. Apologies.

          Show
          Andrew Purtell added a comment - Brandon Li Would be happy to help with that when you think the code is ready. Thanks! I only did cthon04 basic test(no symlink) with the uploaded patch on centos. Please feel free to give it a try. We had reserved some time this week over here to give this a spin, but unfortunately something has come up. Maybe there will still be an opportunity to be helpful down the road. Apologies.
          Hide
          Brandon Li added a comment -

          No problem, Andrew.
          We will use the time to fix more bugs and get some review done first

          Show
          Brandon Li added a comment - No problem, Andrew. We will use the time to fix more bugs and get some review done first
          Hide
          Brandon Li added a comment -

          Updated the patch. Changed hadoop script to be able to start package-included portmap, in case NFS can't register to the rpcbind/portmap provided by the platform due to security level requirement.

          Show
          Brandon Li added a comment - Updated the patch. Changed hadoop script to be able to start package-included portmap, in case NFS can't register to the rpcbind/portmap provided by the platform due to security level requirement.
          Hide
          Brock Noland added a comment -

          Brandon,

          I took a quick look.

          1) I see some e.printStackTrace(); directly around LOG.level().
          2) I see a good number of LOG.level(msg + e); which eats the exception.
          3) I don't see any concept of controlling export by hostname or ip range. FWWIW, that code can probably be taken directly from the NFS4 proxy.

          the NFS gateway is tightly coupled with HDFS protocols

          Can you speak to the reason you chose to use DFSClient directly as opposed to using FileSystem?

          Brock

          Show
          Brock Noland added a comment - Brandon, I took a quick look. 1) I see some e.printStackTrace(); directly around LOG.level(). 2) I see a good number of LOG.level(msg + e); which eats the exception. 3) I don't see any concept of controlling export by hostname or ip range. FWWIW, that code can probably be taken directly from the NFS4 proxy. the NFS gateway is tightly coupled with HDFS protocols Can you speak to the reason you chose to use DFSClient directly as opposed to using FileSystem? Brock
          Hide
          Brandon Li added a comment -

          Hi Brock,
          Thanks for the review!

          I don't see any concept of controlling export by hostname or ip range. FWWIW, that code can probably be taken directly from the NFS4 proxy.

          Cool, thanks!

          Can you speak to the reason you chose to use DFSClient directly as opposed to using FileSystem?

          DFSClient provides finer control of HDFS RPC parameters. Also it could be easier to add new interfaces to DFSClient than FileSystem in case we need some special support from HDFS for NFS. The drawback is that, we have to provide the utilities in FileSystem but not DFSClient, e.g., statistics and cache.

          Show
          Brandon Li added a comment - Hi Brock, Thanks for the review! I don't see any concept of controlling export by hostname or ip range. FWWIW, that code can probably be taken directly from the NFS4 proxy. Cool, thanks! Can you speak to the reason you chose to use DFSClient directly as opposed to using FileSystem? DFSClient provides finer control of HDFS RPC parameters. Also it could be easier to add new interfaces to DFSClient than FileSystem in case we need some special support from HDFS for NFS. The drawback is that, we have to provide the utilities in FileSystem but not DFSClient, e.g., statistics and cache.
          Hide
          Suresh Srinivas added a comment -

          Brandon Li If this work is completed, given all the jiras going in, can you please merge this to branch 2.1?

          Show
          Suresh Srinivas added a comment - Brandon Li If this work is completed, given all the jiras going in, can you please merge this to branch 2.1?
          Hide
          Brandon Li added a comment -

          I've merge the HADOOP-9009,HADOOP-9515,HDFS-4762,HDFS-4948 into branch-2 and branch-2.1

          Show
          Brandon Li added a comment - I've merge the HADOOP-9009 , HADOOP-9515 , HDFS-4762 , HDFS-4948 into branch-2 and branch-2.1
          Hide
          Brandon Li added a comment -

          NFS gateway has been released with 2.2.0.
          This JIRA was continually used for additional bugfixes and code refactoring. There are still a few minor features to be added in the future.

          I am wondering if we should close this JIRA and open new ones for these new features, such as sub-directory mount support, kerbores authentication.

          Show
          Brandon Li added a comment - NFS gateway has been released with 2.2.0. This JIRA was continually used for additional bugfixes and code refactoring. There are still a few minor features to be added in the future. I am wondering if we should close this JIRA and open new ones for these new features, such as sub-directory mount support, kerbores authentication.

            People

            • Assignee:
              Brandon Li
              Reporter:
              Brandon Li
            • Votes:
              0 Vote for this issue
              Watchers:
              43 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development