Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Restructured the package hadoop.dfs.

      Description

      This Jira proposes restructurign the package hadoop.dfs.

      1. Move all server side and internal protocols (NN-DD etc) to hadoop.dfs.server.*

      2. Further breakdown of dfs.server.

      • dfs.server.namenode.*
      • dfs.server.datanode.*
      • dfs.server.balancer.*
      • dfs.server.common.* - stuff shared between the various servers
      • dfs.protocol.* - internal protocol between DN, NN and Balancer etc.

      3. Client interface:

      • hadoop.dfs.DistributedFileSystem.java
      • hadoop.dfs.ChecksumDistributedFileSystem.java
      • hadoop.dfs.HftpFilesystem.java
      • hadoop.dfs.protocol.* - the client side protocol
      1. 2885_run_svn_add_new_file.sh
        0.7 kB
        Sanjay Radia
      2. 2885_run_svn_add_new_file.sh
        0.6 kB
        Sanjay Radia
      3. 2885_run1_svn-commands.sh
        9 kB
        Sanjay Radia
      4. 2885_run1_svn-commands.sh
        9 kB
        Sanjay Radia
      5. 2885_run2_svn-commands.sh
        7 kB
        Sanjay Radia
      6. 2885_run2_svn-commands.sh
        7 kB
        Sanjay Radia
      7. HADOOP-2885_2.patch
        320 kB
        Sanjay Radia
      8. HADOOP-2885_v51.patch
        377 kB
        Sanjay Radia
      9. Prototype dfs package.png
        61 kB
        Sanjay Radia

        Issue Links

          Activity

          Hide
          sanjay.radia Sanjay Radia added a comment -

          This attached file shows a screen shot of the protoyped package structure under eclipse.

          Show
          sanjay.radia Sanjay Radia added a comment - This attached file shows a screen shot of the protoyped package structure under eclipse.
          Hide
          cutting Doug Cutting added a comment -

          It would be more consistent to move the dfs package to org.apache.hadoop.fs.hdfs, and to rename the DistributedFileSystem class to be HDFS. There should be few compatibility issues with this, since applications should not refer directly to hdfs classes. If needed, we could possibly create a org.apache.hadoop.dfs.DistributedFileSystem subclass of org.apache.hadoop.fs.hdfs.HDFS for one release.

          The src/java directory would better be split not in two, but in three: src/java/

          {core,mapred,hdfs}

          . Splitting HDFS into its own tree will help keep the many internal APIs made public by this restructuring from appearing in end-user javadocs, and also better reflect system layering.

          Show
          cutting Doug Cutting added a comment - It would be more consistent to move the dfs package to org.apache.hadoop.fs.hdfs, and to rename the DistributedFileSystem class to be HDFS. There should be few compatibility issues with this, since applications should not refer directly to hdfs classes. If needed, we could possibly create a org.apache.hadoop.dfs.DistributedFileSystem subclass of org.apache.hadoop.fs.hdfs.HDFS for one release. The src/java directory would better be split not in two, but in three: src/java/ {core,mapred,hdfs} . Splitting HDFS into its own tree will help keep the many internal APIs made public by this restructuring from appearing in end-user javadocs, and also better reflect system layering.
          Hide
          tomwhite Tom White added a comment -

          +1 in general on this reorganisation.

          dfs.protocol.* - internal protocol between DN, NN and Balancer etc.

          This would be better as dfs.server.protocol (or actually hdfs.server.protocol).

          Show
          tomwhite Tom White added a comment - +1 in general on this reorganisation. dfs.protocol.* - internal protocol between DN, NN and Balancer etc. This would be better as dfs.server.protocol (or actually hdfs.server.protocol).
          Hide
          sanjay.radia Sanjay Radia added a comment -

          There was a typo in my description: internal protocols was suppose to be dfs.server.protocol (as in the eclipse package display in my prototype attached).

          Show
          sanjay.radia Sanjay Radia added a comment - There was a typo in my description: internal protocols was suppose to be dfs.server.protocol (as in the eclipse package display in my prototype attached).
          Hide
          owen.omalley Owen O'Malley added a comment -

          I think that org.apache.hadoop.hdfs.* is better than org.apache.hadoop.fs.hdfs.*. However, I'm not adamant about it.

          I do feel strongly about how this interacts with src directory splitting.

          core:
          org.apache.hadoop.

          {io,conf,ipc,util,fs}

          hdfs:
          org.apache.hadoop.hdfs (or fs.hdfs)

          mapreduce:
          org.apache.hadoop.mapred

          You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class...

          The kfs and s3 could stay in core because they are very thin wrappers over their respective native file systems.

          Show
          owen.omalley Owen O'Malley added a comment - I think that org.apache.hadoop.hdfs.* is better than org.apache.hadoop.fs.hdfs.*. However, I'm not adamant about it. I do feel strongly about how this interacts with src directory splitting. core: org.apache.hadoop. {io,conf,ipc,util,fs} hdfs: org.apache.hadoop.hdfs (or fs.hdfs) mapreduce: org.apache.hadoop.mapred You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class... The kfs and s3 could stay in core because they are very thin wrappers over their respective native file systems.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          >You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I >think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to >have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class...

          Some pros and cons:
          If you leave DistributedFileSystem and DFSClient in hdfs then client application will need to link against a jar from hdfs. ie client apps
          will link against core.jar and hdfs_client.jar. The advantage of this scheme is that the src tree for core will not show any of the source that the client
          isn't suppose to use.

          If you leave those two in core then the advantage is that the client has to link against one jar: core.jar.
          The disadvantage of this scheme is that the java doc for core should explicitly exclude the hdfs client side classes which happen to sit in the same src
          tree.

          Show
          sanjay.radia Sanjay Radia added a comment - >You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I >think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to >have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class... Some pros and cons: If you leave DistributedFileSystem and DFSClient in hdfs then client application will need to link against a jar from hdfs. ie client apps will link against core.jar and hdfs_client.jar. The advantage of this scheme is that the src tree for core will not show any of the source that the client isn't suppose to use. If you leave those two in core then the advantage is that the client has to link against one jar: core.jar. The disadvantage of this scheme is that the java doc for core should explicitly exclude the hdfs client side classes which happen to sit in the same src tree.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          FSConstants will need to be refactored as part of this restructure.
          Those that are

          • for server side (NN, DN, Balancer etc)
          • many of the individual constants can move to the appropriate classes but there are probably some that are common across NN, DN
          • for client side classes (DistributedFileSystem and DFSClient)
          • opcodes for the client rpc, time outs etc
          • for applications - these should be moved to hadoop.fs or to config defaults.
          • default block size, max path name etc
          • are these properties of hadoop.fs or hdfs?

          There are probably applications that use come of these constants and hence we will need to deprecate FSConstants

          Show
          sanjay.radia Sanjay Radia added a comment - FSConstants will need to be refactored as part of this restructure. Those that are for server side (NN, DN, Balancer etc) many of the individual constants can move to the appropriate classes but there are probably some that are common across NN, DN for client side classes (DistributedFileSystem and DFSClient) opcodes for the client rpc, time outs etc for applications - these should be moved to hadoop.fs or to config defaults. default block size, max path name etc are these properties of hadoop.fs or hdfs? There are probably applications that use come of these constants and hence we will need to deprecate FSConstants
          Hide
          sanjay.radia Sanjay Radia added a comment - - edited

          Here are the 3 proposals on table with their pros and cons

          Terminology: I am calling impls of FileSystem (e.g. DistributedFileSystem) as the wrapper.

          Proposal 1: No HDFS in core

          src/core

          org.apache.hadoop.[io,conf,ipc,util,fs]
          fs constains kfs, s3 wrappers etc BUT no HDFS classes.
          FileSystem.get(conf) constructs DistributedFileSystem via dynamic class loading.

          src/hdfs

          org.apache.hadoop.fs.hdfs contains client side and server side
          Will generate 2 jars: hdfs-client.jar and hdfs-server.jar

          src/mapred

          org.apache.hadoop.mapred

          Pros:

          Can rev the HDFS client protocol by merely supplying a new jar.
          (note that in practice this is not that useful in a distributed system
          since you have distribute the updated protocol jar to all machines
          running the application).
          The hdfs protocol is not visible in core src tree
          javadoc == ALL the classes in core

          Cons:

          App needs 2 jars: core.jar and hdfs-client.jar
          Structure is not similar to fs.kfs and fs.s3
          Harder to make DistribtuedFileSystem public if we wish since it is not sitting
          in core (I don't think we should make it public anyway)

          Proposal 2: Client side HDFS [wrapper and protocol] in core

          src/core

          org.apache.hadoop.[io,conf,ipc,util,fs]
          fs.hdfs contains DistributedFileSystem and DFSClient
          fs constains kfs, s3 wrappers etc

          src/hdfs

          org.apache.hadoop.fs.hdfs contains server side only

          src/mapred

          org.apache.hadoop.mapred

          Pros:

          Apps need only one jar - core
          Structure is a partially similar to fs.kfs and fs.s3
          Partially and not fully similar because DFSClient is in core's fs.hdfs
          The other fs wrappers do not contain their protocols
          Easier to make DistribtuedFileSystem public if we wish since it is sitting
          in core (I don't think we should make it public anyway)

          Cons:

          Reving the HDFS protocol requires updating core
          The hdfs protocol is visible in core src tree
          core's javadoc will need to exclude DFSClient and DistributedFileSystem

          Proposal 3: HDFS Client Wrapper in core, HDFS protocol is separate

          src/core

          org.apache.hadoop.

          {io,conf,ipc,util,fs}

          fs.hdfs contains DistributedFileSystem (but NOT DFSClient)
          Structure is similar to fs.kfs and fs.s3 in that a wrapper for each file system
          sits in core's fs.

          src/hdfs

          org.apache.hadoop.fs.hdfs contains server side and DFSClient
          Two jars

          src/mapred

          org.apache.hadoop.mapred

          Pros:

          Can rev the HDFS client protocol by merely supplying a new jar
          The hdfs protocol is not visible in core src tree
          Structure is similar to fs.kfs and fs.s3
          Easier to make DistribtuedFileSystem public if we wish since it is sitting
          in core (I don't think we should make it public anyway)

          Cons:

          App needs core jar and hdfs-client jar
          Circular dependedncy between core jar and hdfs-client jar
          core's javadoc will need to exclude DistributedFileSystem

          Show
          sanjay.radia Sanjay Radia added a comment - - edited Here are the 3 proposals on table with their pros and cons Terminology: I am calling impls of FileSystem (e.g. DistributedFileSystem) as the wrapper. Proposal 1: No HDFS in core src/core org.apache.hadoop. [io,conf,ipc,util,fs] fs constains kfs, s3 wrappers etc BUT no HDFS classes. FileSystem.get(conf) constructs DistributedFileSystem via dynamic class loading. src/hdfs org.apache.hadoop.fs.hdfs contains client side and server side Will generate 2 jars: hdfs-client.jar and hdfs-server.jar src/mapred org.apache.hadoop.mapred Pros: Can rev the HDFS client protocol by merely supplying a new jar. (note that in practice this is not that useful in a distributed system since you have distribute the updated protocol jar to all machines running the application). The hdfs protocol is not visible in core src tree javadoc == ALL the classes in core Cons: App needs 2 jars: core.jar and hdfs-client.jar Structure is not similar to fs.kfs and fs.s3 Harder to make DistribtuedFileSystem public if we wish since it is not sitting in core (I don't think we should make it public anyway) Proposal 2: Client side HDFS [wrapper and protocol] in core src/core org.apache.hadoop. [io,conf,ipc,util,fs] fs.hdfs contains DistributedFileSystem and DFSClient fs constains kfs, s3 wrappers etc src/hdfs org.apache.hadoop.fs.hdfs contains server side only src/mapred org.apache.hadoop.mapred Pros: Apps need only one jar - core Structure is a partially similar to fs.kfs and fs.s3 Partially and not fully similar because DFSClient is in core's fs.hdfs The other fs wrappers do not contain their protocols Easier to make DistribtuedFileSystem public if we wish since it is sitting in core (I don't think we should make it public anyway) Cons: Reving the HDFS protocol requires updating core The hdfs protocol is visible in core src tree core's javadoc will need to exclude DFSClient and DistributedFileSystem Proposal 3: HDFS Client Wrapper in core, HDFS protocol is separate src/core org.apache.hadoop. {io,conf,ipc,util,fs} fs.hdfs contains DistributedFileSystem (but NOT DFSClient) Structure is similar to fs.kfs and fs.s3 in that a wrapper for each file system sits in core's fs. src/hdfs org.apache.hadoop.fs.hdfs contains server side and DFSClient Two jars src/mapred org.apache.hadoop.mapred Pros: Can rev the HDFS client protocol by merely supplying a new jar The hdfs protocol is not visible in core src tree Structure is similar to fs.kfs and fs.s3 Easier to make DistribtuedFileSystem public if we wish since it is sitting in core (I don't think we should make it public anyway) Cons: App needs core jar and hdfs-client jar Circular dependedncy between core jar and hdfs-client jar core's javadoc will need to exclude DistributedFileSystem
          Hide
          chansler Robert Chansler added a comment -

          A weak vote for Proposal 1. All proposals are improvements on the present. Number 1 most nearly matches my intuition if starting from zero lines of code.

          Show
          chansler Robert Chansler added a comment - A weak vote for Proposal 1. All proposals are improvements on the present. Number 1 most nearly matches my intuition if starting from zero lines of code.
          Hide
          eric14 eric baldeschwieler added a comment -

          I'm struggling to understand all the implications of this. My intuitions about goals...

          1) There should be a top level HDFS sub-project with the servers in it.
          2) Any fs.hdfs section of core should contain as little as possible.

          3) We need to think about reducing the thrash when we change the FS protocol. How do these effect that? A goal should be to provide a stable HDFS interface that isolates Pig and other clients from FS protocol thrash. This is partially a Pig issue, but it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop. Do any of these get us closer? Can we think about this goal while discussing this reorg?

          Show
          eric14 eric baldeschwieler added a comment - I'm struggling to understand all the implications of this. My intuitions about goals... 1) There should be a top level HDFS sub-project with the servers in it. 2) Any fs.hdfs section of core should contain as little as possible. 3) We need to think about reducing the thrash when we change the FS protocol. How do these effect that? A goal should be to provide a stable HDFS interface that isolates Pig and other clients from FS protocol thrash. This is partially a Pig issue, but it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop. Do any of these get us closer? Can we think about this goal while discussing this reorg?
          Hide
          sanjay.radia Sanjay Radia added a comment - - edited

          All three proposals go towards making the interface explicit. If you look at the master/parent jira
          you will see that it was one of the goals. Interface separation and compatibility was
          one of the major motivations of this jira.
          The original proposal (is in the description at the top) is closer to what you, eric,
          are saying (but it was called dfs instead of hdfs).

          Also note that even when interface and impl are under one package,
          the src of the interface and impl can be in separate src trees. Hence even though hdfs
          is under fs, all three proposals move the server part to a separate src tree. The three proposal differ in how much of the
          HDFS client wrapper is in core src tree. Even proposal 1 which keeps the wrapper in src/hdfs proposes that there be two jars.

          Most of sun's java interfaces and impl has different package roots (interferface in java.foo and impl in com.sun.xxx.foo,)
          The style above (keeping the impl under the same package but in a different src tree) is not uncommon for apache
          projects (as doug tells me). In our case, apache is the interface publisher and also the impl publisher.

          Show
          sanjay.radia Sanjay Radia added a comment - - edited All three proposals go towards making the interface explicit. If you look at the master/parent jira you will see that it was one of the goals. Interface separation and compatibility was one of the major motivations of this jira. The original proposal (is in the description at the top) is closer to what you, eric, are saying (but it was called dfs instead of hdfs). Also note that even when interface and impl are under one package, the src of the interface and impl can be in separate src trees. Hence even though hdfs is under fs, all three proposals move the server part to a separate src tree. The three proposal differ in how much of the HDFS client wrapper is in core src tree. Even proposal 1 which keeps the wrapper in src/hdfs proposes that there be two jars. Most of sun's java interfaces and impl has different package roots (interferface in java.foo and impl in com.sun.xxx.foo,) The style above (keeping the impl under the same package but in a different src tree) is not uncommon for apache projects (as doug tells me). In our case, apache is the interface publisher and also the impl publisher.
          Hide
          sanjay.radia Sanjay Radia added a comment - - edited

          As far as hadoop goes, the interface is fs.FileSystem.
          What is the interface of hdfs which implements fs.FileSystem?

          • fs.hdfs.DistributedFileSystem
          • fs.hdfs.theProtocol

          Even though we may consider the above two interfaces to be private, it is worth discussing which of the two interfaces is hdfs's interface. (See my note below about whether
          these two interfaces are considered publi8c or private).

          Analogy
          For NFS, the wire protocol is the interface.
          Proposal 2 would be the most suitable if we consider the HDFS protocol to be the interface. Proposal 1 would also
          be okay as long hdfs supplies 2 jars, Proposal 1 has the advantage that there can be other impls of the client side
          wrappers that talk the hdfs protocol. (for example other wrappers could do client side caching while
          keeping the protocol same).

          For Posix, libc is the interface. The system calls are like the protocol that libc uses to talk to the kernel.
          Each new version of posix would ship new impls of libc and the system calls. Apps link dynamically with libc.
          In a distributed system, distributing a new wrapper to all clients is hard to do since the
          clients are distributed and do not link dyanamically with the wrapper.
          Jini for example provides a way for the clients to pull the new wrapper by means
          of dynamic class loading across the wire (this were heated discussion over this in the java commnunity).
          We have no plans dynamically load classes across the wire. But none the less, the OS view of its
          interface is a useful analogy. Proposal 1 would be most suitable for this view.

          BTW should DistributedFileSystem, DFSClient and the protocol be public or private interfaces?
          So far I don't see any reason to make any of these public (although we should make
          sure that the protocol remains compatible over time).

          Show
          sanjay.radia Sanjay Radia added a comment - - edited As far as hadoop goes, the interface is fs.FileSystem. What is the interface of hdfs which implements fs.FileSystem? fs.hdfs.DistributedFileSystem fs.hdfs.theProtocol Even though we may consider the above two interfaces to be private, it is worth discussing which of the two interfaces is hdfs's interface. (See my note below about whether these two interfaces are considered publi8c or private). Analogy For NFS, the wire protocol is the interface. Proposal 2 would be the most suitable if we consider the HDFS protocol to be the interface. Proposal 1 would also be okay as long hdfs supplies 2 jars, Proposal 1 has the advantage that there can be other impls of the client side wrappers that talk the hdfs protocol. (for example other wrappers could do client side caching while keeping the protocol same). For Posix, libc is the interface. The system calls are like the protocol that libc uses to talk to the kernel. Each new version of posix would ship new impls of libc and the system calls. Apps link dynamically with libc. In a distributed system, distributing a new wrapper to all clients is hard to do since the clients are distributed and do not link dyanamically with the wrapper. Jini for example provides a way for the clients to pull the new wrapper by means of dynamic class loading across the wire (this were heated discussion over this in the java commnunity). We have no plans dynamically load classes across the wire. But none the less, the OS view of its interface is a useful analogy. Proposal 1 would be most suitable for this view. BTW should DistributedFileSystem, DFSClient and the protocol be public or private interfaces? So far I don't see any reason to make any of these public (although we should make sure that the protocol remains compatible over time).
          Hide
          cutting Doug Cutting added a comment -

          Sanjay asks: "which of the two interfaces is hdfs's interface?"

          For HDFS to date, the advertised public interface is fs.FileSystem. We've talked that someday, when we feel the wire protocol is stable, we might make it a public interface, to permit Java-free clients, but we're not there yet. Making the wire protocol public will substantially impact its ability to evolve.

          (1) is my first choice.

          Folks can easily repackage jars, so the number of jars should not be a big factor in this. This issue is primarily about what's public and what's private, and HDFS's implementation should be private.

          The discrepancy from KFS and S3 seems reasonable: HDFS is explicitly designed to implement Hadoop's FileSystem API, while KFS and S3 are not, and need some adapter code. That adapter code is simple enough that we can include it in core. We do not include their entire implementation in core, and HDFS does not require adapter code, since it directly implements the FileSystem API. These differences account for the discrepancy.

          So I don't see any of (1)'s cons as significant.

          Eric says: "it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop". That has more to do with the stability of the abstract FileSystem API rather than changes to HDFS's wire protocol. We should already guarantee that. Our back-compatiblity goal is that, if an application compiles against release X without warnings, it should be able to upgrade to X+1 without recompilation, but will have to recompile and fix new warnings before upgrading to X+2. However we've not always met this goal...

          Show
          cutting Doug Cutting added a comment - Sanjay asks: "which of the two interfaces is hdfs's interface?" For HDFS to date, the advertised public interface is fs.FileSystem. We've talked that someday, when we feel the wire protocol is stable, we might make it a public interface, to permit Java-free clients, but we're not there yet. Making the wire protocol public will substantially impact its ability to evolve. (1) is my first choice. Folks can easily repackage jars, so the number of jars should not be a big factor in this. This issue is primarily about what's public and what's private, and HDFS's implementation should be private. The discrepancy from KFS and S3 seems reasonable: HDFS is explicitly designed to implement Hadoop's FileSystem API, while KFS and S3 are not, and need some adapter code. That adapter code is simple enough that we can include it in core. We do not include their entire implementation in core, and HDFS does not require adapter code, since it directly implements the FileSystem API. These differences account for the discrepancy. So I don't see any of (1)'s cons as significant. Eric says: "it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop". That has more to do with the stability of the abstract FileSystem API rather than changes to HDFS's wire protocol. We should already guarantee that. Our back-compatiblity goal is that, if an application compiles against release X without warnings, it should be able to upgrade to X+1 without recompilation, but will have to recompile and fix new warnings before upgrading to X+2. However we've not always met this goal...
          Hide
          dhruba dhruba borthakur added a comment -

          I vote for Proposal 1. It allows us to ship a new version of HDFS (client and server) without installing a "core" package. Regarding the question of whether the wire protocol or the FileSystem API is the "true" interface, I would say that the FileSystem API is the standard.

          At some future time, if Hadoop becomes so popular that it is widely used, Linux distributions might come pre-packaged with core.jar and hdfs-client.jar pre-installed. In that case, the HDFS wire protocol becomes sacrosanct and public. Option 1 allows this scenario too.

          Show
          dhruba dhruba borthakur added a comment - I vote for Proposal 1. It allows us to ship a new version of HDFS (client and server) without installing a "core" package. Regarding the question of whether the wire protocol or the FileSystem API is the "true" interface, I would say that the FileSystem API is the standard. At some future time, if Hadoop becomes so popular that it is widely used, Linux distributions might come pre-packaged with core.jar and hdfs-client.jar pre-installed. In that case, the HDFS wire protocol becomes sacrosanct and public. Option 1 allows this scenario too.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          My vote is also proposal 1.

          Show
          sanjay.radia Sanjay Radia added a comment - My vote is also proposal 1.
          Hide
          owen.omalley Owen O'Malley added a comment -

          I like 1.

          Show
          owen.omalley Owen O'Malley added a comment - I like 1.
          Hide
          nidaley Nigel Daley added a comment -

          +1 for proposal 1.

          Show
          nidaley Nigel Daley added a comment - +1 for proposal 1.
          Hide
          shv Konstantin Shvachko added a comment -

          Do we get namenode, datanode etc. packages with this proposals?
          Do we split the hdfs package into sub-packages or we just rename hadoop.dfs into hadoop.fs.hdfs?

          Show
          shv Konstantin Shvachko added a comment - Do we get namenode, datanode etc. packages with this proposals? Do we split the hdfs package into sub-packages or we just rename hadoop.dfs into hadoop.fs.hdfs?
          Hide
          sanjay.radia Sanjay Radia added a comment -

          As per the above discussion, fs.FileSystem is the real public interface.
          Do we need to provide backward compatibility for dfs.DistributedFileSystem and dfs.DFSClient which are currently public?

          BTW as per proposal 1, the package name will change from dfs to hdfs.
          hdfs.DFSClient and hdfs.DistributedFileSystem are not for public use,
          although based on the internal needs of the hdfs package these classes may or may not be public (haven't quite figured out the details yet).

          Show
          sanjay.radia Sanjay Radia added a comment - As per the above discussion, fs.FileSystem is the real public interface. Do we need to provide backward compatibility for dfs.DistributedFileSystem and dfs.DFSClient which are currently public? BTW as per proposal 1, the package name will change from dfs to hdfs. hdfs.DFSClient and hdfs.DistributedFileSystem are not for public use, although based on the internal needs of the hdfs package these classes may or may not be public (haven't quite figured out the details yet).
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          We should also fix HADOOP-1826 in this issue.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - We should also fix HADOOP-1826 in this issue.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          scrips to run svn commands and patch have been attached.

          The steps are:

          1) run
          2885_run1_svn_commands.sh

          2) Verify that src/hdfs/org/apache/hadoop/dfs contains NO FILES and ONLY the directories namenode and datanode
          If this is not the case then some new files have sneeked in since I created the script.
          The script will have to be fixed to include these new files.

          If the dir is empty (except for namenode/metrics & datanode.metrics) then run the svn command
          svn rm src/hdfs/org/apache/hadoop/dfs

          3) run
          2885_run2-svn-commands.sh

          4) Verify that src/test/org/apache/hadoop/dfs contain NO FILES.
          If this is not the case then some new files have sneeked in since I created the script.
          The script will have to be fixed to include these new files.

          If the dir is empty then please run the svn command
          svn rm src/test/org/apache/hadoop/dfs

          5) run 'patch -p0 < HADOOP-2885.patch'

          6) Now add the new files to svn ( these files contain classes that were split from existing files)
          run
          2885_run_svn_add_new_file.sh

          7) Rebuild and test.

          Show
          sanjay.radia Sanjay Radia added a comment - scrips to run svn commands and patch have been attached. The steps are: 1) run 2885_run1_svn_commands.sh 2) Verify that src/hdfs/org/apache/hadoop/dfs contains NO FILES and ONLY the directories namenode and datanode If this is not the case then some new files have sneeked in since I created the script. The script will have to be fixed to include these new files. If the dir is empty (except for namenode/metrics & datanode.metrics) then run the svn command svn rm src/hdfs/org/apache/hadoop/dfs 3) run 2885_run2-svn-commands.sh 4) Verify that src/test/org/apache/hadoop/dfs contain NO FILES. If this is not the case then some new files have sneeked in since I created the script. The script will have to be fixed to include these new files. If the dir is empty then please run the svn command svn rm src/test/org/apache/hadoop/dfs 5) run 'patch -p0 < HADOOP-2885 .patch' 6) Now add the new files to svn ( these files contain classes that were split from existing files) run 2885_run_svn_add_new_file.sh 7) Rebuild and test.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          >As per the above discussion, fs.FileSystem is the real public interface.
          >Do we need to provide backward compatibility for dfs.DistributedFileSystem and dfs.DFSClient which are currently public?

          No one should be using these two dfs classes directly because fs.FileSystem, provides the necessary functionality.
          My suggestion is that we don't provide backward compatibility as part of this Jira. If we hear complaints from folks using these two dfs classes then
          we can add the backward compatible classes as a separate Jira in the current release.
          It would be useful to quickly surface users of these two "private" classes.

          Also I will file a new Jira to fix the build of the Javadoc to remove the hdfs classes form the public javadoc (with a blocker for current release).

          Show
          sanjay.radia Sanjay Radia added a comment - >As per the above discussion, fs.FileSystem is the real public interface. >Do we need to provide backward compatibility for dfs.DistributedFileSystem and dfs.DFSClient which are currently public? No one should be using these two dfs classes directly because fs.FileSystem, provides the necessary functionality. My suggestion is that we don't provide backward compatibility as part of this Jira. If we hear complaints from folks using these two dfs classes then we can add the backward compatible classes as a separate Jira in the current release. It would be useful to quickly surface users of these two "private" classes. Also I will file a new Jira to fix the build of the Javadoc to remove the hdfs classes form the public javadoc (with a blocker for current release).
          Hide
          owen.omalley Owen O'Malley added a comment -

          We should drop the DistributedChecksumFileSystem, but that can be done as a separate patch.

          Other than that, it looks good. +1

          Show
          owen.omalley Owen O'Malley added a comment - We should drop the DistributedChecksumFileSystem, but that can be done as a separate patch. Other than that, it looks good. +1
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this. Thanks, Sanjay!

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this. Thanks, Sanjay!
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          The servlets generated from jsp still using old package name. For example, org.apache.hadoop.dfs.dfshealth_jsp.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - The servlets generated from jsp still using old package name. For example, org.apache.hadoop.dfs.dfshealth_jsp.
          Hide
          hudson Hudson added a comment -
          Show
          hudson Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

            • Assignee:
              sanjay.radia Sanjay Radia
              Reporter:
              sanjay.radia Sanjay Radia
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development