|
> src/core/org.apache.hadoop.dfs
Do you really mean to use directories with dots in their names, or do you mean slashes? That's an unusual naming convention for java source... What would the full path of an HDFS source file be? Also, we should stop using the term 'dfs' and consistently use 'hdfs' instead. Yes, it is a little redundant, but that's okay: better redundant than inconsistent. Long-ago, when we added URIs, we decided that 'dfs' was too ambiguous, and that 'hdfs' was the preferred name. We did not rename packages or classes then, to avoid the disruption, but that was always the intent. >> src/core/org.apache.hadoop.dfs
>Do you really mean to use directories with dots in their names, or do you mean slashes? That's an unusual naming convention for java source... >What would the full path of an HDFS source file be? No it would be slashes - the standard convention - I was trying to clarify where the package structure starts. >Also, we should stop using the term 'dfs' and consistently use 'hdfs' instead. ..... The purpose of this Jira is to break down the hdfs package restructuring (HADOOP 2885) so that part of the work can be done separately. The first phase - (this JIRA) would have NO package renames, only src tree changes. SO at the end of phased 2 (HADOOP-2885), the structure will look as you have proposed. I am trying to figure out what exactly need to be done in this jira. My understanding :
Yes. your understanding correct Ragu.
The only undecided is the location of the client-side of the hdfs - does it go in src/core or src/hdfs. HADOOP-2885 is discussing this issue. As soon as that decision is made you can finish this jira. Thanks Why make this change independent of HADOOP-2885? In either case, HADOOP-2885 will not be implementable as a patch file, but will require direct svn operations to rename files. Both issues will break other existing patch files. So why not combine them?
Perhaps what should be submitted for things like this is a shell script that's run before any patches are applied. This can do all of the 'svn mkdir' and 'svn mv' commands. Ideally hudson would even run such scripts, but, in the meantime, it will at least let others preview what's intended. So the procedure would be: sh svn-commands.sh patch -p 0 < patch.txt > sh svn-commands.sh
> patch -p 0 < patch.txt This sounds good. Regd this being separate issue from HADOOP-2885, I am not sure. May be this will be good test run before more complicated changes in HADOOP-2885. > The only undecided is the location of the client-side of the hdfs - does it go in src/core or src/hdfs.
I will leave it as it is now (which agrees with Proposal 1 in HADOOP-2885). >Why make this change independent of HADOOP-2885? In either case, HADOOP-2885 will not be implementable as a patch file, but will require direct svn >operations to rename files. Both issues will break other existing patch files. So why not combine them?
2885 will require that we block other patches from being submitted commits while 2885 is being completed. After this jira is completed, HADOOP-2885
After this jira, javadoc cleanup of core/mapred can start independent of 2885 (required for 0.17). Furthermore, there is a small possibility that 2885 might not make in 0.17 while this jira is likely to be completed in 0.17. svn-commands.sh and patch for build.xml is attached.
to apply the changes, start with a clean check out of the trunk and run the following command at top level directory : $ sh svn-commands.sh $ patch -p0 < HADOOP-2916.patch If this fails for any reason, try : $ svn revert -R src build.xml $ rm -rf src/core src/mapred src/hdfs There are no changes made to any file under src/. The changes to build.xml could be better if I had more experience with ant. Essentially each instance of 'src/java' need be replaced by three directories : src/core, src/mapred, and src/hdfs. But this can not be done using just a simple property that has these three values concatinated. Please suggest if there is a better way. If this patch looks good, I will provide a script that can covert patches for current trunk to the new trunk. I thought we also wanted to add src/benchmarks directory in order to move all benchmarks from src/test retaining the package structure.
The patch looks fine. It does what is proposed. But I still question whether it is worthwhile to do this independently of HADOOP-2885. Performed as two steps we break DFS patches twice. This first step is completely automated (in a script), so it requires no freeze to trunk, and the script will not get stale. It would be least disruptive if source paths are only changed once. What is the downside of combining these?
Updated svn-commands.sh creates src/benchmarks directory and moves src/test/gridmix there.
There are more files that belong in src/benchmarks but those might be more involved changes. >The patch looks fine. It does what is proposed. But I still question whether it is worthwhile to do this independently of HADOOP-2885.
> Performed as two steps we break DFS patches twice. > This first step is completely automated (in a script), so it requires no freeze to trunk, and the script will not get stale. > It would be least disruptive if source paths are only changed once. What is the downside of combining these? Since the second one requires freeze, it just seems easier to cut down the freeze time by getting as much done before as possible. Given that there is a scricp, most folks patches are managed automatically so in a sense the patches do not really "break" > Given that there is a scricp, most folks patches are managed automatically so in a sense the patches do not really "break"
I'm not sure what you're arguing. This patch will break every src/java patch file, no? Any patch generated before this is comitted will no longer apply after this is committed. HADOOP-2885 will similarly break all src/java/org/apache/hadoop/dfs patches. This patch will not get stale, since it is a simple script. So shouldn't we bundle this together with HADOOP-2885, so that we only break patches once, rather than break them twice? Am I missing something? Are you arguing that this won't break patches? Are you arguing that breaking patches twice isn't worse than breaking them once? > This patch will break every src/java patch file, no?
Any such re-org will break most patches irrespective of whether its a small or big reorg. We have to deal with it. Attached covert-patch.sed script converts old patch to new one (usage: {sed -f covert-patch.sed < old.patch > new.patch}). With the larger re-org where individual files move and probably get renamed, the conversion script needs to be lot more complex I am going to test the covertion a little bit. It should cover 99% of the patches. As far as I am concerned, if this smaller reorg makes sense, then it could go irrespective of HADOOP-2885.
What is the current status of this jira? How does this jira look if assume (the most likely case) that HADOOP-2885 will not be committed in near future (not in 3-6 months)?
I would like to either resolve this or unassign to me. > What is the current status of this jira?
What are the benefits of these changes alone? Some related goals I have are:
Does this as it stand help these? Splitting the tree doesn't address the first much, since the real issue there is dependencies, not directory structure. But I guess it's a start. I can see some progress towards the second here too. Moving the HDFS server code into a separate tree means we can more easily exclude it from javadocs. But this does not address moving non-user mapred classes from the javadoc, does it? I think I'd prefer issues that more directly address these goals, or to more precisely state the goals and benefits of this issue. This makes sense, Doug. I will let you guys discuss if this patch has any benefits on its own or when it should go. Since this is assigned to be me it is natural think I should have the full context (w.r.t HADOOP-2885 etc). Thats why I would prefer to be unassigned. I am not involved with actual restructuring goals, plans, schedule, etc. As you noted Doug, it is a start towards the two goals.
In the past I considered doing this as a single big bang and but the issue was finding the right time for it. Splitting the Jira was proposed because I felt that it increased the chances of getting this work through by allowing incremental progress. In a discussion in a corridor earlier today, Arun and Owen suggested delaying this and 2885 to a release boundary. So the suggestion is to apply this patch just before the 0.18 feature freeze (ie the last patch before the feature freeze). > So the suggestion is to apply this patch just before the 0.18 feature freeze
Hmm. My tendency would be to do it just after a branch, i.e., shortly after a freeze. That's when trunk has the longest time to stabilize before it's next branched. The reason Owen suggested this was that the patch would be in the truck Shouldn't we generally avoid big changes right before a branch? If we want it in 0.17, then we should do it now, not at the last minute, no? But I still don't see any advantage of doing this independently of HADOOP2885...
> Doug Cutting - 19/May/08 11:12 AM
> Shouldn't we generally avoid big changes right before a branch? If we want it in 0.17, then we > should do it now, not at the last minute, no? But I still don't see any advantage of doing this > independently of HADOOP2885... -1 on including this in 0.17. It is too radical of a change for a 'patch' release. It is not meant for 0.17. The main question (once we have decided to commit) is whether this should be committed just before branching 0.18 or just after.
+1 for 'just before', so that we don't need make two patches for 0.18 and trunk. Sorry, I got confused and said '17' above when I meant '18'. What I meant to say was: "If we want it in 0.18, then we should do it now, not at the last minute, no?" In general, we want changes in trunk longer, so more folks have a chance to work with them and identify problems. This may require documentation changes, script changes, etc., which will not be caught by unit testing but only by use.
+1 for doing it now rather than later. It is true that we might have to covert fewer patcher if we do it later but I think other advantages of committing earlier weigh more. Also, I will be on vacation for a month after this week.
Let's go ahead and do this on monday next week. Does that sound reasonable to people?
> Let's go ahead and do this on monday next week. Does that sound reasonable to people?
Next Monday is 5 days before a planned release branch, right? I'd still rather see this done at the same time as HADOOP-2885, earlier in a release cycle. I see no point in breaking patches twice: if we're going to re-organize the codebase, we should do it all at once, rather than dragging out the pain. But, as it appears I am alone in this belief, I will not veto this. BTW, the description of this issue proposes to change the jar files, but the attached patch does not do that: all that it does is re-arrange source code. Given the late point in the release cycle, I think this is a wise choice. Restructuring the jar files may have substantial impacts, and ought to be done in trunk earlier in a release cycle. I have updated the description to say that the jar files are NOT split.
I just committed this. Thanks Raghu!
It seems to me that their are still references to the old paths. At least in the eclipse templates (see
I am unsure about whether reopening the bug was the right way to do. Please accept my apologizes if it is not. Does
Sorry. The problems I was talking about are mostly fixed by
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I really don't think the local, hdfs, kfs clients belong in core at all. It doesn't make sense to have Map-Reduce depend on the kfs client libraries...
We really should be having {hdfs|kfs}_server.jar and {hdfs|kfs|local}_client.jar. Thoughts?
Maybe I can be persuaded to let local-fs_client in hadoop-core.jar! smile
The one other view is to have a fs_core.jar which contains only the fs interface and other file-system generics. This way hadoop-core is really just the 'core' infrastructure.. and projects such as pig/zookeeper could then only pull-in hadoop-core.jar.
Yeah, yeah, this is just to prove that it's weird to have hdfs in 'core' and not map-reduce! wink
+1