Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2631

Add optional libwebhdfs support to fuse-dfs

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fuse-dfs, webhdfs
    • Labels:
      None

      Description

      We should port the implementation of fuse-dfs to use the webhdfs protocol. This has a number of benefits:

      • Compatibility - allows a single fuse client to work across server versions
      • Works with both WebHDFS and Hoop since they are protocol compatible
      • Removes the overhead related to libhdfs (forking a jvm)
      • Makes it easier to support features like security
      1. HDFS-2631.1.patch
        1.37 MB
        Jaimin D Jetly
      2. HDFS-2631.patch
        1.37 MB
        Jaimin D Jetly

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          Eli, this is a great idea!

          Show
          Tsz Wo Nicholas Sze added a comment - Eli, this is a great idea!
          Hide
          Todd Lipcon added a comment -

          I'm a little confused: why is this a good idea? Seems like it's likely to end up much slower than the current implementation. I'd prefer it as another option, rather than a "rewrite".

          Show
          Todd Lipcon added a comment - I'm a little confused: why is this a good idea? Seems like it's likely to end up much slower than the current implementation. I'd prefer it as another option, rather than a "rewrite".
          Hide
          Eli Collins added a comment -

          I think it's probably be best write a libhdfs-compatible library that uses webhdfs (HDFS-2656), keep the existing fuse-dfs code as is, and have the ability to swap in the new library.

          Show
          Eli Collins added a comment - I think it's probably be best write a libhdfs-compatible library that uses webhdfs ( HDFS-2656 ), keep the existing fuse-dfs code as is, and have the ability to swap in the new library.
          Hide
          Todd Lipcon added a comment -

          That seems reasonable. I think it's a given that we need to keep the original libhdfs for performance. Having a libhdfs-alike that goes over HTTP seems reasonable enough but not always preferable. To speak to each of the original points:

          Compatibility - allows a single fuse client to work across server versions

          We need to address compatibility for clients in general. Our Java client (and hence libhdfs) need this just as much as fuse.

          Works with both WebHDFS and Hoop since they are protocol compatible

          I guess this is an advantage, but given that libhdfs already wraps arbitrary hadoop filesystems, we already have this capability.

          Removes the overhead related to libhdfs (forking a jvm)

          fuse is a long-running client, so the fork overhead seems minimal. Recent improvements in libhdfs have also cut out most of the copying overhead.

          Makes it easier to support features like security

          Perhaps - but libhdfs needs security anyway, so I don't think it buys us much.

          Show
          Todd Lipcon added a comment - That seems reasonable. I think it's a given that we need to keep the original libhdfs for performance. Having a libhdfs-alike that goes over HTTP seems reasonable enough but not always preferable. To speak to each of the original points: Compatibility - allows a single fuse client to work across server versions We need to address compatibility for clients in general. Our Java client (and hence libhdfs) need this just as much as fuse. Works with both WebHDFS and Hoop since they are protocol compatible I guess this is an advantage, but given that libhdfs already wraps arbitrary hadoop filesystems, we already have this capability. Removes the overhead related to libhdfs (forking a jvm) fuse is a long-running client, so the fork overhead seems minimal. Recent improvements in libhdfs have also cut out most of the copying overhead. Makes it easier to support features like security Perhaps - but libhdfs needs security anyway, so I don't think it buys us much.
          Hide
          Jaimin D Jetly added a comment -

          This is an initial patch of fuse-webhdfs contrib project. Work is under progress.
          you can review the code and read the README file for the installation .
          Project supports all directory level commands and limited file level commands like read (more), write (indirection >), append (>>), find.

          Other functions that could be implemented in the future is statfs , symlink , access.

          For now one needs to manually download libcurl and libjson libraries and configure+make (refer README). But I will get it done through ant in the next patch.

          Show
          Jaimin D Jetly added a comment - This is an initial patch of fuse-webhdfs contrib project. Work is under progress. you can review the code and read the README file for the installation . Project supports all directory level commands and limited file level commands like read (more), write (indirection >), append (>>), find. Other functions that could be implemented in the future is statfs , symlink , access. For now one needs to manually download libcurl and libjson libraries and configure+make (refer README). But I will get it done through ant in the next patch.
          Hide
          Jaimin D Jetly added a comment -

          This is the 2nd updated patch. I have replaced usage of json-c library with Jansson library. Also fixed minor bugs. For installation, compilation and execution of the project, refer README file.

          Show
          Jaimin D Jetly added a comment - This is the 2nd updated patch. I have replaced usage of json-c library with Jansson library. Also fixed minor bugs. For installation, compilation and execution of the project, refer README file.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Jaimin,

          It's great that you're working on this.

          I think it would be best if you kept the existing libhdfs API. That way, users can easily switch back and forth between the JNI based libhdfs and your webhdfs-based libhdfs. If you do not do this, all applications will have to be rewritten, which may limit the number of people who can use your work.

          In a similar vein, I think you should avoid changing fuse-dfs in this patch (it would definitely make it a lot smaller). And if you implement the existing API, then obviously there's no reason to modify FUSE at all.

          Finally, we're using CMake now so you should update your patch to make use of that. CMake is very straightforward. Let me know if you have any questions or if you want to see an example CMakeLists.txt.

          Show
          Colin Patrick McCabe added a comment - Hi Jaimin, It's great that you're working on this. I think it would be best if you kept the existing libhdfs API. That way, users can easily switch back and forth between the JNI based libhdfs and your webhdfs-based libhdfs. If you do not do this, all applications will have to be rewritten, which may limit the number of people who can use your work. In a similar vein, I think you should avoid changing fuse-dfs in this patch (it would definitely make it a lot smaller). And if you implement the existing API, then obviously there's no reason to modify FUSE at all. Finally, we're using CMake now so you should update your patch to make use of that. CMake is very straightforward. Let me know if you have any questions or if you want to see an example CMakeLists.txt.
          Hide
          Eli Collins added a comment -

          Agree w Colin's suggestions. A WebHDFS-based implementation of libhdfs would be useful beyond fuse-dfs.

          Show
          Eli Collins added a comment - Agree w Colin's suggestions. A WebHDFS-based implementation of libhdfs would be useful beyond fuse-dfs.
          Hide
          Suresh Srinivas added a comment -

          Colin, I have already commented to this effect on HDFS-2656.

          Show
          Suresh Srinivas added a comment - Colin, I have already commented to this effect on HDFS-2656 .
          Hide
          Jaimin D Jetly added a comment -

          Hi Colin,
          This patch does not replace/alter fuse-dfs (that uses JNI based libhdfs) and this patch does not use existing libhdfs API.

          Implementation in the patch uses its own API (based on libcurl and Jansson library).

          On your last suggestion, I will surely go through CMake.

          Show
          Jaimin D Jetly added a comment - Hi Colin, This patch does not replace/alter fuse-dfs (that uses JNI based libhdfs) and this patch does not use existing libhdfs API. Implementation in the patch uses its own API (based on libcurl and Jansson library). On your last suggestion, I will surely go through CMake.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Jaimin,

          I don't think we want to copy and paste all the fuse code to another directory just because we're relying on a different backend library. That would really increase the maintenance burden since we'd be fixing the same bugs in two places, etc.

          As Suresh said (both here and in HDFS-2656), we really do want to keep that existing API. fuse_dfs isn't the only libhdfs application out there!

          Let me know if there's anything I can do to help, whatever that may be. It would be really nice to have the option of running without a JVM in libhdfs...

          Show
          Colin Patrick McCabe added a comment - Hi Jaimin, I don't think we want to copy and paste all the fuse code to another directory just because we're relying on a different backend library. That would really increase the maintenance burden since we'd be fixing the same bugs in two places, etc. As Suresh said (both here and in HDFS-2656 ), we really do want to keep that existing API. fuse_dfs isn't the only libhdfs application out there! Let me know if there's anything I can do to help, whatever that may be. It would be really nice to have the option of running without a JVM in libhdfs...
          Hide
          xiongwen added a comment -

          hello Jaimin
          where can i download HDFS-2631.patch , thanks for you attention!
          i plan to test IO performance of hdfs by filebench,
          including (seqread ,randomread,seqwrite,randomwrite )
          i also think fuse-webhdfs may be better than fuse-dfs

          Show
          xiongwen added a comment - hello Jaimin where can i download HDFS-2631 .patch , thanks for you attention! i plan to test IO performance of hdfs by filebench, including (seqread ,randomread,seqwrite,randomwrite ) i also think fuse-webhdfs may be better than fuse-dfs
          Hide
          Suresh Srinivas added a comment -

          Given that libwebhdfs work from HDFS-2656 is committed, I am changing the title of this jira.

          Show
          Suresh Srinivas added a comment - Given that libwebhdfs work from HDFS-2656 is committed, I am changing the title of this jira.
          Hide
          Eli Collins added a comment -

          Isn't the plan to make libwebhdfs compatible with libhdfs and then fuse-dfs can work with either libhdfs or libwebhdfs? I think we should keep the default to libhdfs since it's more stable and we've got a lot of users on it.

          Show
          Eli Collins added a comment - Isn't the plan to make libwebhdfs compatible with libhdfs and then fuse-dfs can work with either libhdfs or libwebhdfs? I think we should keep the default to libhdfs since it's more stable and we've got a lot of users on it.
          Hide
          Todd Lipcon added a comment -

          I still think this should be an additional choice but not phrased as a "rewrite" of the original fuse-dfs.

          Show
          Todd Lipcon added a comment - I still think this should be an additional choice but not phrased as a "rewrite" of the original fuse-dfs.
          Hide
          Colin Patrick McCabe added a comment -

          Yeah. The goal should be to have a fuse_dfs binary that can work with either libwebhdfs.so or libhdfs.so. The interfaces are exactly the same and there should be no need for recompilation.

          Show
          Colin Patrick McCabe added a comment - Yeah. The goal should be to have a fuse_dfs binary that can work with either libwebhdfs.so or libhdfs.so. The interfaces are exactly the same and there should be no need for recompilation.
          Hide
          Eli Collins added a comment -

          Updated the subject.

          @Jing, mind updating the affects and target version for this jira? One of the things that made HDFS-2656 hard to follow is that no target version was set so it was hard to see what release the change was intended for. If you'd like to shoot for v2 I'd set the affects version to 2.0 and the target version to 2.0.3.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Updated the subject. @Jing, mind updating the affects and target version for this jira? One of the things that made HDFS-2656 hard to follow is that no target version was set so it was hard to see what release the change was intended for. If you'd like to shoot for v2 I'd set the affects version to 2.0 and the target version to 2.0.3. Thanks, Eli
          Hide
          Suresh Srinivas added a comment -

          Eli, thanks for updating the summary.

          One of the things that made HDFS-2656 hard to follow is that no target version was set so it was hard to see what release the change was intended for.

          Not sure I understand this comment and why it was hard to follow. When nothing is provided in these fields is it not intended for trunk?

          Show
          Suresh Srinivas added a comment - Eli, thanks for updating the summary. One of the things that made HDFS-2656 hard to follow is that no target version was set so it was hard to see what release the change was intended for. Not sure I understand this comment and why it was hard to follow. When nothing is provided in these fields is it not intended for trunk?
          Hide
          Jing Zhao added a comment -

          Thanks for the updating Eli! Right now because I only think about the trunk for the feature, maybe we can leave the target version blank at this time.

          Show
          Jing Zhao added a comment - Thanks for the updating Eli! Right now because I only think about the trunk for the feature, maybe we can leave the target version blank at this time.
          Hide
          Eli Collins added a comment -

          Not sure I understand this comment and why it was hard to follow. When nothing is provided in these fields is it not intended for trunk?

          It's not clear, often having it unset means people don't know what to set it to, and trunk issues are set to 3.0. For this jira the affects version was set to 3.0 but w/o a target version, then the affects version was unset, then the fix version was set to 2.0.3-alpha when it was committed, then the fix version was changed 3.0 today, so it was hard to see where it was going.

          Show
          Eli Collins added a comment - Not sure I understand this comment and why it was hard to follow. When nothing is provided in these fields is it not intended for trunk? It's not clear, often having it unset means people don't know what to set it to, and trunk issues are set to 3.0. For this jira the affects version was set to 3.0 but w/o a target version, then the affects version was unset, then the fix version was set to 2.0.3-alpha when it was committed, then the fix version was changed 3.0 today, so it was hard to see where it was going.
          Hide
          Suresh Srinivas added a comment -

          It's not clear, often having it unset means people don't know what to set it to, and trunk issues are set to 3.0. For this jira the affects version was set to 3.0 but w/o a target version, then the affects version was unset, then the fix version was set to 2.0.3-alpha when it was committed, then the fix version was changed 3.0 today, so it was hard to see where it was going.

          If you mean it is hard to understand what the intended target is, I sort of see that. But following the changes happening in jira etc. should not be an issue.

          Show
          Suresh Srinivas added a comment - It's not clear, often having it unset means people don't know what to set it to, and trunk issues are set to 3.0. For this jira the affects version was set to 3.0 but w/o a target version, then the affects version was unset, then the fix version was set to 2.0.3-alpha when it was committed, then the fix version was changed 3.0 today, so it was hard to see where it was going. If you mean it is hard to understand what the intended target is, I sort of see that. But following the changes happening in jira etc. should not be an issue.
          Hide
          Eli Collins added a comment -

          Yea, that's why we use the target version.

          Show
          Eli Collins added a comment - Yea, that's why we use the target version.

            People

            • Assignee:
              Jing Zhao
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development