Hadoop Common
  1. Hadoop Common
  2. HADOOP-5010

Document HTTP/HTTPS methods to read directory and file data

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: None
    • Component/s: documentation
    • Labels:
      None

      Description

      In HADOOP-1563, Doug Cutting wrote:

      The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.

      Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).

      In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.

      NB, to the best of my knowledge, HFTP is only documented on the distcp page, and HSFTP is not documented at all?

      1. 5010-0.patch
        0.7 kB
        Chris Douglas

        Activity

        Hide
        Eli Collins added a comment -

        HDFS proxy was removed, hftp is documented.

        Show
        Eli Collins added a comment - HDFS proxy was removed, hftp is documented.
        Hide
        Doug Cutting added a comment -

        So is the expectation is that this will work to for world-readable files? Should we document that? Marco, is this sufficient?

        Show
        Doug Cutting added a comment - So is the expectation is that this will work to for world-readable files? Should we document that? Marco, is this sufficient?
        Hide
        Chris Douglas added a comment -

        So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol.

        IMHO, hftp should live while it is necessary for cross-version compatibility. That alone will keep it around longer than we might otherwise. This adds a line to the "web interface" section of the HDFS user guide, noting the syntax for the FileDataServlet but ignoring any permissions issues.

        Show
        Chris Douglas added a comment - So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol. IMHO, hftp should live while it is necessary for cross-version compatibility. That alone will keep it around longer than we might otherwise. This adds a line to the "web interface" section of the HDFS user guide, noting the syntax for the FileDataServlet but ignoring any permissions issues.
        Hide
        Doug Cutting added a comment -

        So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol. If we do, that will encourage folks to build other tools that use it, and we will need to support it longer-term than we might otherwise. If only GET of a file's content is required, then perhaps that's all we should document?

        Show
        Doug Cutting added a comment - So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol. If we do, that will encourage folks to build other tools that use it, and we will need to support it longer-term than we might otherwise. If only GET of a file's content is required, then perhaps that's all we should document?
        Hide
        Marco Nicosia added a comment -

        Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools

        Reading is sufficient (from my original request). I didn't know that there's a combination of HTTP requests which will allow an http client to get directory listings and file data.

        Does listPaths and the .../data/... component respect the dfs.web.ugi directive? (But then, this is what HDFS proxy was invented for, so permissions should be a non-issue.) When Hadoop becomes kerberized, these servlets need to require credentials over HTTP.

        but they need to be documented.

        Yes. Switching this bug to a documentation task.

        Show
        Marco Nicosia added a comment - Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools Reading is sufficient (from my original request). I didn't know that there's a combination of HTTP requests which will allow an http client to get directory listings and file data. Does listPaths and the .../data/... component respect the dfs.web.ugi directive? (But then, this is what HDFS proxy was invented for, so permissions should be a non-issue.) When Hadoop becomes kerberized, these servlets need to require credentials over HTTP. but they need to be documented. Yes. Switching this bug to a documentation task.
        Hide
        Kan Zhang added a comment - - edited

        > Can you say more about the requirement?

        We have an internal interface by which clients can query a metadata server to get file locations and other metadata, and then use HTTPS clients like curl to retrieve the actual files from datastore servers. We want hdfsproxy to act as datastore servers. Directory listings are not required. Neither is support for browsers. Just using HTTP GET request to fetch a file. Could be a thin wrapper around existing streamFile servlet. Some convention is needed, like how to specify the cluster that the file is stored on since we want one hdfsproxy to proxy for multiple HDFS clusters.

        Show
        Kan Zhang added a comment - - edited > Can you say more about the requirement? We have an internal interface by which clients can query a metadata server to get file locations and other metadata, and then use HTTPS clients like curl to retrieve the actual files from datastore servers. We want hdfsproxy to act as datastore servers. Directory listings are not required. Neither is support for browsers. Just using HTTP GET request to fetch a file. Could be a thin wrapper around existing streamFile servlet. Some convention is needed, like how to specify the cluster that the file is stored on since we want one hdfsproxy to proxy for multiple HDFS clusters.
        Hide
        Chris Douglas added a comment -

        HFTP is only documented on the distcp page, and HSFTP is not documented at all?

        HSFTP is the same protocol- the same server- over an SSL connector; we can speak of them interchangeably. The HFTP protocol is not documented outside of its FileSystem implementation, which should be remedied, but the premise for this issue seems ill defined.

        I don't know to what "plain", "pure", and "standard" HTTP refers in a filesystem context, if not adherence to an RFC for which there are already tools. If not WebDAV, then either some other standard must be chosen, or we define our own conventions for listing directories, writing/appending to files, deleting resources, managing permissions, etc. Unless we also want to write a client- which returns us to where we started- are there better options than picking a standard and (partially?) implementing it?

        Here the focus seems to be on a servlet that implements the server-side of this for HDFS. That seems reasonable. It would also be browsable, which is nice.

        Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools, but they need to be documented.

        Show
        Chris Douglas added a comment - HFTP is only documented on the distcp page, and HSFTP is not documented at all? HSFTP is the same protocol- the same server- over an SSL connector; we can speak of them interchangeably. The HFTP protocol is not documented outside of its FileSystem implementation, which should be remedied, but the premise for this issue seems ill defined. I don't know to what "plain", "pure", and "standard" HTTP refers in a filesystem context, if not adherence to an RFC for which there are already tools. If not WebDAV, then either some other standard must be chosen, or we define our own conventions for listing directories, writing/appending to files, deleting resources, managing permissions, etc. Unless we also want to write a client- which returns us to where we started- are there better options than picking a standard and (partially?) implementing it? Here the focus seems to be on a servlet that implements the server-side of this for HDFS. That seems reasonable. It would also be browsable, which is nice. Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools, but they need to be documented.
        Hide
        Doug Cutting added a comment -

        > There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software.

        Or simply run something like 'ssh foo distcp ...', where foo is a host in the cluster. It would be better to know more about the requirement.

        > add a pure HTTP support for retrieving files using standard HTTP clients like curl

        https://issues.apache.org/jira/browse/HADOOP-1563?focusedCommentId=12510760#action_12510760

        In that comment I suggest a convention for encoding directory listings as links in the HTML of slash-ending URLs. I also provided a patch there that implements a client for this. Here the focus seems to be on a servlet that implements the server-side of this for HDFS. That seems reasonable. It would also be browsable, which is nice.

        > we actually have a requirement for it in Yahoo

        Can you say more about the requirement? Are directory listings required? Is other file status information required? Some file status can be done in HTTP (e.g., the last-modified header), but some does not have a natural place (e.g., owner, group & permissions).

        Show
        Doug Cutting added a comment - > There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software. Or simply run something like 'ssh foo distcp ...', where foo is a host in the cluster. It would be better to know more about the requirement. > add a pure HTTP support for retrieving files using standard HTTP clients like curl https://issues.apache.org/jira/browse/HADOOP-1563?focusedCommentId=12510760#action_12510760 In that comment I suggest a convention for encoding directory listings as links in the HTML of slash-ending URLs. I also provided a patch there that implements a client for this. Here the focus seems to be on a servlet that implements the server-side of this for HDFS. That seems reasonable. It would also be browsable, which is nice. > we actually have a requirement for it in Yahoo Can you say more about the requirement? Are directory listings required? Is other file status information required? Some file status can be done in HTTP (e.g., the last-modified header), but some does not have a natural place (e.g., owner, group & permissions).
        Hide
        Kan Zhang added a comment -

        I suggest
        1. We keep the HSFTP interface on hdfsproxy as is, so that existing filesystem clients like distcp can continue work.
        2. In the short term, add a pure HTTP support for retrieving files using standard HTTP clients like curl. This may fall short of a full-fledged system like WebDAV. But it's very useful by itself (we actually have a requirement for it in Yahoo) and a good starting point.

        Show
        Kan Zhang added a comment - I suggest 1. We keep the HSFTP interface on hdfsproxy as is, so that existing filesystem clients like distcp can continue work. 2. In the short term, add a pure HTTP support for retrieving files using standard HTTP clients like curl. This may fall short of a full-fledged system like WebDAV. But it's very useful by itself (we actually have a requirement for it in Yahoo) and a good starting point.
        Hide
        Marco Nicosia added a comment -

        Distcp is the best tool today for this. How is it insufficient?

        Distcp works for pulling data from a source, or pushing data to a source. In both cases, distcp implies running a Hadoop job. There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software.

        That's possible. An appropriate HTTP-based standard for filesystem access might be WebDav.

        Implementing an accepted standard is a more ambitious project.

        I remember previous attempts to make WebDav available, and recognize that as an ambitious goal.

        My naive thought is that HFTP is very close to a much simpler feature. The main purpose of the HDFS proxy could be to make HDFS files available to a standard web client (curl, Net::HTTP, etc) to retrieve file listings and file contents from the HDFS proxy without installing an HDFS client, which is required to speak H

        {S}FTP.

        The only difference is that HDFS proxy/H{S}

        FTP have invented an internal way of exposing this data where existing standards could have been used?

        Show
        Marco Nicosia added a comment - Distcp is the best tool today for this. How is it insufficient? Distcp works for pulling data from a source, or pushing data to a source. In both cases, distcp implies running a Hadoop job. There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software. That's possible. An appropriate HTTP-based standard for filesystem access might be WebDav. Implementing an accepted standard is a more ambitious project. I remember previous attempts to make WebDav available, and recognize that as an ambitious goal. My naive thought is that HFTP is very close to a much simpler feature. The main purpose of the HDFS proxy could be to make HDFS files available to a standard web client (curl, Net::HTTP, etc) to retrieve file listings and file contents from the HDFS proxy without installing an HDFS client, which is required to speak H {S}FTP. The only difference is that HDFS proxy/H{S} FTP have invented an internal way of exposing this data where existing standards could have been used?
        Hide
        Doug Cutting added a comment -

        > use some well accepted standard

        That's possible. An appropriate HTTP-based standard for filesystem access might be WebDav.

        HFTP was designed to meet a particular goal: cross-version filesystem access. Implementing an accepted standard is a more ambitious project.

        > Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS.

        Distcp is the best tool today for this. How is it insufficient? We have an FTP FileSystem implementation, so we can import data from external systems that way. One can also use file: uri's to import data from NFS. We could implement a WebDav filesystem, so that folks could dav: URI's to import and export datasets from web servers that have mod_dav installed. Would that help?

        Show
        Doug Cutting added a comment - > use some well accepted standard That's possible. An appropriate HTTP-based standard for filesystem access might be WebDav. HFTP was designed to meet a particular goal: cross-version filesystem access. Implementing an accepted standard is a more ambitious project. > Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS. Distcp is the best tool today for this. How is it insufficient? We have an FTP FileSystem implementation, so we can import data from external systems that way. One can also use file: uri's to import data from NFS. We could implement a WebDav filesystem, so that folks could dav: URI's to import and export datasets from web servers that have mod_dav installed. Would that help?
        Hide
        Marco Nicosia added a comment -

        So maybe all we need is better documentation of what's passed over HTTP?

        If there's a guarantee that subsequent connections need not always connect to the same server (ie, any session over the protocol is managed either via a single continuous HTTP/1.1 connection, cookies, or some other session management), then yes, more documentation on how the HTTP protocol is used will allow "creative" admins to use existing HTTP infrastructure in their Hadoop deployments.

        I think we cannot simply use standard HTTP because it does not support file system access.

        If the limitation is that HTTP doesn't specify how to get/put structured data (such as a directory listing), why not use some well accepted standard, such as REST?

        The reason I'm pushing for this is that the closer Hadoop comes to presenting some standards-compliant interface, the easier it becomes for users to integrate Hadoop into existing infrastructure(s). Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS.

        Show
        Marco Nicosia added a comment - So maybe all we need is better documentation of what's passed over HTTP? If there's a guarantee that subsequent connections need not always connect to the same server (ie, any session over the protocol is managed either via a single continuous HTTP/1.1 connection, cookies, or some other session management), then yes, more documentation on how the HTTP protocol is used will allow "creative" admins to use existing HTTP infrastructure in their Hadoop deployments. I think we cannot simply use standard HTTP because it does not support file system access. If the limitation is that HTTP doesn't specify how to get/put structured data (such as a directory listing), why not use some well accepted standard, such as REST? The reason I'm pushing for this is that the closer Hadoop comes to presenting some standards-compliant interface, the easier it becomes for users to integrate Hadoop into existing infrastructure(s). Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        >.. does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS?

        HFTP is a file system interface, which is currently implemented with HTTP. I agree if you say that the name HFTP is bad or miss-leading.

        The "artificial" part of HFTP defines the way of accessing the file system. For example,

        Path p = new Path("hftp://namenode:port/foo/bar");
        FileStatus status = p.getFileSystem(conf).listStatus(p);
        ...
        

        the code above is indeed accessing "http://namenode:port/listPaths/foo/bar?ugi=user,groups", where listPaths is a servlet running on the NameNode and ugi is a parameter. Then the NameNode will reply the output in xml format back to the HFTP client. The HFTP client constructs a FileStatus object and returns it. Without the HFTP interface, clients have to know all the details including the servlet name, url parameters, xml format, etc. in order to access the file system.

        I think we cannot simply use standard HTTP because it does not support file system access.

        Show
        Tsz Wo Nicholas Sze added a comment - >.. does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? HFTP is a file system interface, which is currently implemented with HTTP. I agree if you say that the name HFTP is bad or miss-leading. The "artificial" part of HFTP defines the way of accessing the file system. For example, Path p = new Path( "hftp: //namenode:port/foo/bar" ); FileStatus status = p.getFileSystem(conf).listStatus(p); ... the code above is indeed accessing "http://namenode:port/listPaths/foo/bar?ugi=user,groups", where listPaths is a servlet running on the NameNode and ugi is a parameter. Then the NameNode will reply the output in xml format back to the HFTP client. The HFTP client constructs a FileStatus object and returns it. Without the HFTP interface, clients have to know all the details including the servlet name, url parameters, xml format, etc. in order to access the file system. I think we cannot simply use standard HTTP because it does not support file system access.
        Hide
        Doug Cutting added a comment -

        > simply offer standard HTTP and HTTPS

        HFTP and HSFTP are just internal naming schemes, a way to encode HDFS file names but indicate that a different mechanism should be used to access them.

        > That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.

        We already use HTTP and HTTPS as the transport for HFTP and HSFTP. So maybe all we need is better documentation of what's passed over HTTP?

        Show
        Doug Cutting added a comment - > simply offer standard HTTP and HTTPS HFTP and HSFTP are just internal naming schemes, a way to encode HDFS file names but indicate that a different mechanism should be used to access them. > That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc. We already use HTTP and HTTPS as the transport for HFTP and HSFTP. So maybe all we need is better documentation of what's passed over HTTP?

          People

          • Assignee:
            Unassigned
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development