Issue Details (XML | Word | Printable)

Key: MAPREDUCE-558
Type: Sub-task Sub-task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sanjay Radia
Reporter: Sanjay Radia
Votes: 0
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce
MAPREDUCE-556

Refactor src structure, but leave package structure alone

Created: 29/Feb/08 02:02 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File convert-patch.sed 2008-05-14 11:09 PM Raghu Angadi 0.4 kB
File convert-patch.sed 2008-05-14 10:11 PM Raghu Angadi 0.4 kB
Text File Licensed for inclusion in ASF works HADOOP-2916.patch 2008-03-12 01:08 AM Raghu Angadi 4 kB
File Licensed for inclusion in ASF works svn-commands.sh 2008-04-16 10:10 PM Raghu Angadi 0.8 kB
File Licensed for inclusion in ASF works svn-commands.sh 2008-03-12 12:59 AM Raghu Angadi 0.6 kB

Resolution Date: 20/Jun/08 03:35 PM


 Description  « Hide
This Jira proposes that the src structure be split as below.
The package structure remains the same for this Jira. (Package renaming is part of other JIras such as HADOOP-2885).

The idea is that the src will be split BEFORE the package restructuring

The new proposed src structure is

src/test - unchanged

src/java - will no longer exist , its content will be move to one of core, hdfs, or mapred

src/core - this will contain the core classes that hadoop applications need to link against.
It will contain client side libraries of all fs file systems: local, hdfs, kfs, etc

src/core/org.apache.hadoop.{conf, fs, filechache, io, ipc, log, metrics, net, record, security, tools, util)
src/core/org.apache.hadoop.dfs - this will contain only the client side parts of dfs.
HADOOP-2885 will rename package dfs to package fs.hdfs

src/hdfs/org.apache.hadoop.dfs - this will contain only the server side of hdfs.
HADOOP-2885 will rename package dfs to package fs.hdfs later; a compatible dfs.DistributedFileSystem will be left for compatibility/

src/mapred/org.apache.hadoop.mapred.*

This Jira will not split the jar files.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Arun C Murthy added a comment - 29/Feb/08 02:17 AM

src/core - this will contain the core classes that hadoop applications need to link against.
It will contain client side libraries of all fs file systems: local, hdfs, kfs, etc
jar name hadoop_core.jar

I really don't think the local, hdfs, kfs clients belong in core at all. It doesn't make sense to have Map-Reduce depend on the kfs client libraries...
We really should be having {hdfs|kfs}_server.jar and {hdfs|kfs|local}_client.jar. Thoughts?
Maybe I can be persuaded to let local-fs_client in hadoop-core.jar! smile

The one other view is to have a fs_core.jar which contains only the fs interface and other file-system generics. This way hadoop-core is really just the 'core' infrastructure.. and projects such as pig/zookeeper could then only pull-in hadoop-core.jar.

Yeah, yeah, this is just to prove that it's weird to have hdfs in 'core' and not map-reduce! wink

Initially one jar: hadoop_mapred.jar

+1


Doug Cutting added a comment - 29/Feb/08 05:34 PM
> src/core/org.apache.hadoop.dfs

Do you really mean to use directories with dots in their names, or do you mean slashes? That's an unusual naming convention for java source... What would the full path of an HDFS source file be?

Also, we should stop using the term 'dfs' and consistently use 'hdfs' instead. Yes, it is a little redundant, but that's okay: better redundant than inconsistent. Long-ago, when we added URIs, we decided that 'dfs' was too ambiguous, and that 'hdfs' was the preferred name. We did not rename packages or classes then, to avoid the disruption, but that was always the intent.


Sanjay Radia added a comment - 04/Mar/08 12:18 AM
>> src/core/org.apache.hadoop.dfs

>Do you really mean to use directories with dots in their names, or do you mean slashes? That's an unusual naming convention for java source... >What would the full path of an HDFS source file be?

No it would be slashes - the standard convention - I was trying to clarify where the package structure starts.
Sorry for the confusion.

>Also, we should stop using the term 'dfs' and consistently use 'hdfs' instead. .....
The package will be renamed to hdfs as I noted in the description.

The purpose of this Jira is to break down the hdfs package restructuring (HADOOP 2885) so that part of the work can be done separately.

The first phase - (this JIRA) would have NO package renames, only src tree changes.
The second phase (Jira HADOOP-2885) will do package renames -ie hadoop.dfs will be renamed to hadoop.fs.hdfs in this 2nd phase.

SO at the end of phased 2 (HADOOP-2885), the structure will look as you have proposed.


Raghu Angadi added a comment - 10/Mar/08 11:45 PM
I am trying to figure out what exactly need to be done in this jira. My understanding :
  • No changes in files should be required. They should only be moved.
    • So dfs package is still called dfs. It will be under src/hdfs/o/a/h/dfs
  • Anything under src/java that does not go into src/hdfs and src/mapred goes into src/core
  • There is no new jar. still just hadoop-version-core.jar.

Sanjay Radia added a comment - 11/Mar/08 12:45 AM
Yes. your understanding correct Ragu.
The only undecided is the location of the client-side of the hdfs - does it go in src/core or src/hdfs.
HADOOP-2885 is discussing this issue. As soon as that decision is made you can finish this jira.

Thanks
sanjay


Doug Cutting added a comment - 11/Mar/08 06:34 PM
Why make this change independent of HADOOP-2885? In either case, HADOOP-2885 will not be implementable as a patch file, but will require direct svn operations to rename files. Both issues will break other existing patch files. So why not combine them?

Perhaps what should be submitted for things like this is a shell script that's run before any patches are applied. This can do all of the 'svn mkdir' and 'svn mv' commands. Ideally hudson would even run such scripts, but, in the meantime, it will at least let others preview what's intended. So the procedure would be:

sh svn-commands.sh
patch -p 0 < patch.txt

Raghu Angadi added a comment - 11/Mar/08 06:43 PM
> sh svn-commands.sh
> patch -p 0 < patch.txt

This sounds good.

Regd this being separate issue from HADOOP-2885, I am not sure. May be this will be good test run before more complicated changes in HADOOP-2885.


Raghu Angadi added a comment - 11/Mar/08 06:54 PM
> The only undecided is the location of the client-side of the hdfs - does it go in src/core or src/hdfs.
I will leave it as it is now (which agrees with Proposal 1 in HADOOP-2885).

Sanjay Radia added a comment - 11/Mar/08 09:21 PM
>Why make this change independent of HADOOP-2885? In either case, HADOOP-2885 will not be implementable as a patch file, but will require direct svn >operations to rename files. Both issues will break other existing patch files. So why not combine them?

2885 will require that we block other patches from being submitted commits while 2885 is being completed.
Hence, we would like to finish 2885 in under 2 days.
The idea was break down HADOOP-2885 into smaller parts; this jira (2916) was an attempt todo that.
Note that this jira has the property that it has no binary changes; in contrast 2885 changes binary due to package renames.

After this jira is completed, HADOOP-2885

  • will not require any renames in src/mapred or src/core;
  • will change only imports in src/mapred and src/core
    • Patches on src/mapred and src/core will mostly be auto merged
  • will require renames within src/hdfs due to the change in package structure within hdfs.
    • Doug is right in that 2885 will require renames within src/hdfs and hence we are duplicating those renames.

After this jira, javadoc cleanup of core/mapred can start independent of 2885 (required for 0.17).
Similarly, work on separate jars can also start if desired (not required for 0.17)

Furthermore, there is a small possibility that 2885 might not make in 0.17 while this jira is likely to be completed in 0.17.


Raghu Angadi added a comment - 12/Mar/08 01:08 AM - edited
svn-commands.sh and patch for build.xml is attached.
to apply the changes, start with a clean check out of the trunk and run the following
command at top level directory :
$ sh svn-commands.sh
$ patch -p0 < HADOOP-2916.patch

If this fails for any reason, try :

$ svn revert -R src build.xml
$ rm -rf src/core src/mapred src/hdfs
and try to run script again.

There are no changes made to any file under src/.

The changes to build.xml could be better if I had more experience with ant. Essentially each instance of 'src/java' need be replaced by three directories : src/core, src/mapred, and src/hdfs. But this can not be done using just a simple property that has these three values concatinated. Please suggest if there is a better way.


Raghu Angadi added a comment - 12/Mar/08 01:11 AM

If this patch looks good, I will provide a script that can covert patches for current trunk to the new trunk.


Konstantin Shvachko added a comment - 12/Mar/08 02:30 AM
I thought we also wanted to add src/benchmarks directory in order to move all benchmarks from src/test retaining the package structure.

Doug Cutting added a comment - 12/Mar/08 04:03 PM
The patch looks fine. It does what is proposed. But I still question whether it is worthwhile to do this independently of HADOOP-2885. Performed as two steps we break DFS patches twice. This first step is completely automated (in a script), so it requires no freeze to trunk, and the script will not get stale. It would be least disruptive if source paths are only changed once. What is the downside of combining these?

Raghu Angadi added a comment - 16/Apr/08 10:10 PM
Updated svn-commands.sh creates src/benchmarks directory and moves src/test/gridmix there.
There are more files that belong in src/benchmarks but those might be more involved changes.

Arun C Murthy added a comment - 14/May/08 09:12 AM
+1, looks reasonable.

Sanjay Radia added a comment - 14/May/08 08:37 PM
>The patch looks fine. It does what is proposed. But I still question whether it is worthwhile to do this independently of HADOOP-2885.
> Performed as two steps we break DFS patches twice.
> This first step is completely automated (in a script), so it requires no freeze to trunk, and the script will not get stale.
> It would be least disruptive if source paths are only changed once. What is the downside of combining these?
Since the second one requires freeze, it just seems easier to cut down the freeze time by getting as much done before as possible.
Given that there is a scricp, most folks patches are managed automatically so in a sense the patches do not really "break"

Doug Cutting added a comment - 14/May/08 09:17 PM
> Given that there is a scricp, most folks patches are managed automatically so in a sense the patches do not really "break"

I'm not sure what you're arguing.

This patch will break every src/java patch file, no? Any patch generated before this is comitted will no longer apply after this is committed. HADOOP-2885 will similarly break all src/java/org/apache/hadoop/dfs patches.

This patch will not get stale, since it is a simple script. So shouldn't we bundle this together with HADOOP-2885, so that we only break patches once, rather than break them twice?

Am I missing something? Are you arguing that this won't break patches? Are you arguing that breaking patches twice isn't worse than breaking them once?


Raghu Angadi added a comment - 14/May/08 10:11 PM
> This patch will break every src/java patch file, no?
Any such re-org will break most patches irrespective of whether its a small or big reorg. We have to deal with it.

Attached covert-patch.sed script converts old patch to new one (usage: {sed -f covert-patch.sed < old.patch > new.patch}).

With the larger re-org where individual files move and probably get renamed, the conversion script needs to be lot more complex

I am going to test the covertion a little bit. It should cover 99% of the patches.


Raghu Angadi added a comment - 14/May/08 10:14 PM
As far as I am concerned, if this smaller reorg makes sense, then it could go irrespective of HADOOP-2885.

Raghu Angadi added a comment - 15/May/08 06:37 PM
What is the current status of this jira? How does this jira look if assume (the most likely case) that HADOOP-2885 will not be committed in near future (not in 3-6 months)?

I would like to either resolve this or unassign to me.


Doug Cutting added a comment - 15/May/08 07:09 PM
> What is the current status of this jira?

What are the benefits of these changes alone? Some related goals I have are:

  • eventually split hdfs and mapred into separate subprojects
  • remove non-user APIs from javadocs

Does this as it stand help these? Splitting the tree doesn't address the first much, since the real issue there is dependencies, not directory structure. But I guess it's a start.

I can see some progress towards the second here too. Moving the HDFS server code into a separate tree means we can more easily exclude it from javadocs. But this does not address moving non-user mapred classes from the javadoc, does it?

I think I'd prefer issues that more directly address these goals, or to more precisely state the goals and benefits of this issue.


Raghu Angadi added a comment - 15/May/08 07:22 PM

This makes sense, Doug.

I will let you guys discuss if this patch has any benefits on its own or when it should go. Since this is assigned to be me it is natural think I should have the full context (w.r.t HADOOP-2885 etc). Thats why I would prefer to be unassigned. I am not involved with actual restructuring goals, plans, schedule, etc.


Sanjay Radia added a comment - 15/May/08 09:10 PM
As you noted Doug, it is a start towards the two goals.

In the past I considered doing this as a single big bang and but the issue was finding the right time for it.
When is the best time to do such a disruptive patch? - it is best to do it around release boundaries when there are very few outstanding patches.

Splitting the Jira was proposed because I felt that it increased the chances of getting this work through by allowing incremental progress.
This Jira (2916) affects everyone; however, the MR folks are not effected by the second larger patch (2885).
It also means that this Jira can get through quickly (or so I thought .. see my next paragraph).
My feeling is the 2885 will not happen for a while - because these has been and will continue to be push back to find a better time to do it.

In a discussion in a corridor earlier today, Arun and Owen suggested delaying this and 2885 to a release boundary.
(Yes we should have had that discussion in the jira but it it was impromptu; I have requested them to comment on this Jira. )
They feel that now is a very bad time because there are two many outstanding patches and that this Jira is too disruptive in spite of ragu's script!
So even with this smaller patch which has a convenient script there is very strong push back.
I expect a much bigger push back for the 2885 (Unless I do it over Christmas

So the suggestion is to apply this patch just before the 0.18 feature freeze (ie the last patch before the feature freeze).
Are folks okay with this?


Doug Cutting added a comment - 16/May/08 03:35 AM
> So the suggestion is to apply this patch just before the 0.18 feature freeze

Hmm. My tendency would be to do it just after a branch, i.e., shortly after a freeze. That's when trunk has the longest time to stabilize before it's next branched.


Sanjay Radia added a comment - 16/May/08 10:44 PM

The reason Owen suggested this was that the patch would be in the truck
and hence any 0.18 bug fixes would apply to
the 0.18 and to trunk.
BTW it is possible to do 2885 right after the branch if we are willing
to hold the release for an extra few days.
(This would be effectively combining the two).


Doug Cutting added a comment - 19/May/08 06:12 PM
Shouldn't we generally avoid big changes right before a branch? If we want it in 0.17, then we should do it now, not at the last minute, no? But I still don't see any advantage of doing this independently of HADOOP2885...

Jim Kellerman added a comment - 19/May/08 07:22 PM
> Doug Cutting - 19/May/08 11:12 AM
> Shouldn't we generally avoid big changes right before a branch? If we want it in 0.17, then we
> should do it now, not at the last minute, no? But I still don't see any advantage of doing this
> independently of HADOOP2885...

-1 on including this in 0.17. It is too radical of a change for a 'patch' release.
+1 on including this in 0.18. Hiding implementation is a good idea.


Raghu Angadi added a comment - 19/May/08 07:44 PM
It is not meant for 0.17. The main question (once we have decided to commit) is whether this should be committed just before branching 0.18 or just after.

+1 for 'just before', so that we don't need make two patches for 0.18 and trunk.


Doug Cutting added a comment - 19/May/08 07:54 PM
Sorry, I got confused and said '17' above when I meant '18'. What I meant to say was: "If we want it in 0.18, then we should do it now, not at the last minute, no?" In general, we want changes in trunk longer, so more folks have a chance to work with them and identify problems. This may require documentation changes, script changes, etc., which will not be caught by unit testing but only by use.

Raghu Angadi added a comment - 19/May/08 09:15 PM - edited
+1 for doing it now rather than later. It is true that we might have to covert fewer patcher if we do it later but I think other advantages of committing earlier weigh more. Also, I will be on vacation for a month after this week.

Owen O'Malley added a comment - 28/May/08 11:51 PM
Let's go ahead and do this on monday next week. Does that sound reasonable to people?

Doug Cutting added a comment - 29/May/08 05:16 PM
> Let's go ahead and do this on monday next week. Does that sound reasonable to people?

Next Monday is 5 days before a planned release branch, right?

I'd still rather see this done at the same time as HADOOP-2885, earlier in a release cycle. I see no point in breaking patches twice: if we're going to re-organize the codebase, we should do it all at once, rather than dragging out the pain. But, as it appears I am alone in this belief, I will not veto this.

BTW, the description of this issue proposes to change the jar files, but the attached patch does not do that: all that it does is re-arrange source code. Given the late point in the release cycle, I think this is a wise choice. Restructuring the jar files may have substantial impacts, and ought to be done in trunk earlier in a release cycle.


Sanjay Radia added a comment - 29/May/08 07:15 PM
I have updated the description to say that the jar files are NOT split.

Mukund Madhugiri added a comment - 07/Jun/08 12:30 AM
I just committed this. Thanks Raghu!

Brice Arnould added a comment - 07/Jun/08 02:43 PM
It seems to me that their are still references to the old paths. At least in the eclipse templates (see HADOOP-3480), the clover.setup target of build.xml and the documentation (in the form of links to WebSVN from src/docs/src/documentation/content/xdocs/streaming.xml).

I am unsure about whether reopening the bug was the right way to do. Please accept my apologizes if it is not.


Owen O'Malley added a comment - 19/Jun/08 11:13 PM
Does HADOOP-3480 reflect all of the problems that you've seen? If so, we should reclose this patch. (In general, it is better to open new jiras so that we don't confuse the change log too much).

Brice Arnould added a comment - 20/Jun/08 07:42 AM
Sorry. The problems I was talking about are mostly fixed by HADOOP-3575 and HADOOP-3480. I opened HADOOP-3607 to fix a wrong URL, but appart from that I don't there's still references to the old structure.