|
[
Permlink
| « Hide
]
Sanjay Radia added a comment - 22/Feb/08 10:38 PM
This attached file shows a screen shot of the protoyped package structure under eclipse.
It would be more consistent to move the dfs package to org.apache.hadoop.fs.hdfs, and to rename the DistributedFileSystem class to be HDFS. There should be few compatibility issues with this, since applications should not refer directly to hdfs classes. If needed, we could possibly create a org.apache.hadoop.dfs.DistributedFileSystem subclass of org.apache.hadoop.fs.hdfs.HDFS for one release.
The src/java directory would better be split not in two, but in three: src/java/{core,mapred,hdfs}. Splitting HDFS into its own tree will help keep the many internal APIs made public by this restructuring from appearing in end-user javadocs, and also better reflect system layering. There was a typo in my description: internal protocols was suppose to be dfs.server.protocol (as in the eclipse package display in my prototype attached).
I think that org.apache.hadoop.hdfs.* is better than org.apache.hadoop.fs.hdfs.*. However, I'm not adamant about it.
I do feel strongly about how this interacts with src directory splitting. core: hdfs: mapreduce: You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class... The kfs and s3 could stay in core because they are very thin wrappers over their respective native file systems. >You can't put DistributedFileSystem and DFSClient in separate src directories without making a cyclic dependence and that is very bad. Therefore, I >think they both need to be in the hdfs src tree. I think it is less confusing to have the src trees not overlap packages and therefore it would be better to >have it in org.apache.hadoop.hdfs. I would even propose merging DFSClient and DistributeFileSystem into a single class...
Some pros and cons: If you leave those two in core then the advantage is that the client has to link against one jar: core.jar. FSConstants will need to be refactored as part of this restructure.
Those that are
There are probably applications that use come of these constants and hence we will need to deprecate FSConstants Here are the 3 proposals on table with their pros and cons
Terminology: I am calling impls of FileSystem (e.g. DistributedFileSystem) as the wrapper. Proposal 1: No HDFS in coresrc/core org.apache.hadoop.[io,conf,ipc,util,fs] src/hdfs org.apache.hadoop.fs.hdfs contains client side and server side src/mapredorg.apache.hadoop.mapred Pros: Can rev the HDFS client protocol by merely supplying a new jar. Cons: App needs 2 jars: core.jar and hdfs-client.jar Proposal 2: Client side HDFS [wrapper and protocol] in coresrc/core org.apache.hadoop.[io,conf,ipc,util,fs] src/hdfsorg.apache.hadoop.fs.hdfs contains server side only src/mapredorg.apache.hadoop.mapred Pros: Apps need only one jar - core Cons: Reving the HDFS protocol requires updating core Proposal 3: HDFS Client Wrapper in core, HDFS protocol is separatesrc/core org.apache.hadoop.{io,conf,ipc,util,fs} src/hdfs org.apache.hadoop.fs.hdfs contains server side and DFSClient src/mapredorg.apache.hadoop.mapred Pros: Can rev the HDFS client protocol by merely supplying a new jar Cons: App needs core jar and hdfs-client jar A weak vote for Proposal 1. All proposals are improvements on the present. Number 1 most nearly matches my intuition if starting from zero lines of code.
I'm struggling to understand all the implications of this. My intuitions about goals...
1) There should be a top level HDFS sub-project with the servers in it. 3) We need to think about reducing the thrash when we change the FS protocol. How do these effect that? A goal should be to provide a stable HDFS interface that isolates Pig and other clients from FS protocol thrash. This is partially a Pig issue, but it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop. Do any of these get us closer? Can we think about this goal while discussing this reorg? All three proposals go towards making the interface explicit. If you look at the master/parent jira
you will see that it was one of the goals. Interface separation and compatibility was one of the major motivations of this jira. The original proposal (is in the description at the top) is closer to what you, eric, are saying (but it was called dfs instead of hdfs). Also note that even when interface and impl are under one package, Most of sun's java interfaces and impl has different package roots (interferface in java.foo and impl in com.sun.xxx.foo,) As far as hadoop goes, the interface is fs.FileSystem.
Even though we may consider the above two interfaces to be private, it is worth discussing which of the two interfaces is hdfs's interface. (See my note below about whether Analogy For Posix, libc is the interface. The system calls are like the protocol that libc uses to talk to the kernel. BTW should DistributedFileSystem, DFSClient and the protocol be public or private interfaces? Sanjay asks: "which of the two interfaces is hdfs's interface?"
For HDFS to date, the advertised public interface is fs.FileSystem. We've talked that someday, when we feel the wire protocol is stable, we might make it a public interface, to permit Java-free clients, but we're not there yet. Making the wire protocol public will substantially impact its ability to evolve. (1) is my first choice. Folks can easily repackage jars, so the number of jars should not be a big factor in this. This issue is primarily about what's public and what's private, and HDFS's implementation should be private. The discrepancy from KFS and S3 seems reasonable: HDFS is explicitly designed to implement Hadoop's FileSystem API, while KFS and S3 are not, and need some adapter code. That adapter code is simple enough that we can include it in core. We do not include their entire implementation in core, and HDFS does not require adapter code, since it directly implements the FileSystem API. These differences account for the discrepancy. So I don't see any of (1)'s cons as significant. Eric says: "it would be terrific if we did not need to recompile a client to run against two dot releases of hadoop". That has more to do with the stability of the abstract FileSystem API rather than changes to HDFS's wire protocol. We should already guarantee that. Our back-compatiblity goal is that, if an application compiles against release X without warnings, it should be able to upgrade to X+1 without recompilation, but will have to recompile and fix new warnings before upgrading to X+2. However we've not always met this goal... I vote for Proposal 1. It allows us to ship a new version of HDFS (client and server) without installing a "core" package. Regarding the question of whether the wire protocol or the FileSystem API is the "true" interface, I would say that the FileSystem API is the standard.
At some future time, if Hadoop becomes so popular that it is widely used, Linux distributions might come pre-packaged with core.jar and hdfs-client.jar pre-installed. In that case, the HDFS wire protocol becomes sacrosanct and public. Option 1 allows this scenario too. Do we get namenode, datanode etc. packages with this proposals?
Do we split the hdfs package into sub-packages or we just rename hadoop.dfs into hadoop.fs.hdfs? As per the above discussion, fs.FileSystem is the real public interface.
Do we need to provide backward compatibility for dfs.DistributedFileSystem and dfs.DFSClient which are currently public? BTW as per proposal 1, the package name will change from dfs to hdfs. We should also fix HADOOP-1826 in this issue.
scrips to run svn commands and patch have been attached.
The steps are: 1) run 2) Verify that src/hdfs/org/apache/hadoop/dfs contains NO FILES and ONLY the directories namenode and datanode If the dir is empty (except for namenode/metrics & datanode.metrics) then run the svn command 3) run 4) Verify that src/test/org/apache/hadoop/dfs contain NO FILES. If the dir is empty then please run the svn command 5) run 'patch -p0 < HADOOP-2885.patch' 6) Now add the new files to svn ( these files contain classes that were split from existing files) 7) Rebuild and test. >As per the above discussion, fs.FileSystem is the real public interface. No one should be using these two dfs classes directly because fs.FileSystem, provides the necessary functionality. Also I will file a new Jira to fix the build of the Javadoc to remove the hdfs classes form the public javadoc (with a blocker for current release). We should drop the DistributedChecksumFileSystem, but that can be done as a separate patch.
Other than that, it looks good. +1 I just committed this. Thanks, Sanjay!
The servlets generated from jsp still using old package name. For example, org.apache.hadoop.dfs.dfshealth_jsp.
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||