Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: scripts
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We need to split the bin/hadoop into 3 parts for core, mapred and hdfs. This will enable us to distribute the individual scripts with each component.

      1. namenode-1.log
        34 kB
        Vinod Kumar Vavilapalli
      2. namenode.log
        0.7 kB
        Vinod Kumar Vavilapalli
      3. 4868_v4.patch
        35 kB
        Sharad Agarwal
      4. 4868_v3.patch
        78 kB
        Sharad Agarwal
      5. 4868_v2.patch
        31 kB
        Sharad Agarwal
      6. 4868_v1.patch
        26 kB
        Sharad Agarwal

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21. Subtask.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21. Subtask.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )
          Hide
          Sharad Agarwal added a comment -

          I think this should not be an incompatible change.

          Show
          Sharad Agarwal added a comment - I think this should not be an incompatible change.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          On a second thought, this may not be an incompatible change. Please correct me if I was wrong.

          However, I think it is good to add release note since it is not clear how to use the new scripts. The error message "Instead use the hdfs command for it." seems saying that we should use "./bin/hadoop hdfs ..."

          bash-3.2$ ./bin/hadoop dfsadmin   
          DEPRECATED: Use of this script to execute hdfs command is deprecated.
          Instead use the hdfs command for it.
          
          Usage: java DFSAdmin
                     [-report]
                     [-safemode enter | leave | get | wait]
                     [-saveNamespace]
                     [-restoreFailedStorage true|false|check]
                     [-refreshNodes]
                     [-finalizeUpgrade]
                     [-upgradeProgress status | details | force]
                     [-metasave filename]
                     [-refreshServiceAcl]
                     [-setQuota <quota> <dirname>...<dirname>]
                     [-clrQuota <dirname>...<dirname>]
                     [-setSpaceQuota <quota> <dirname>...<dirname>]
                     [-clrSpaceQuota <dirname>...<dirname>]
                     [-help [cmd]]
          
          Generic options supported are
          -conf <configuration file>     specify an application configuration file
          -D <property=value>            use value for given property
          -fs <local|namenode:port>      specify a namenode
          -jt <local|jobtracker:port>    specify a job tracker
          -files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
          -libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
          -archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
          
          The general command line syntax is
          bin/hadoop command [genericOptions] [commandOptions]
          
          bash-3.2$ ./bin/hadoop hdfs    
          java.lang.NoClassDefFoundError: hdfs
          Caused by: java.lang.ClassNotFoundException: hdfs
                  at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
                  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
                  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
                  at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
                  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
          Exception in thread "main" bash-3.2$ 
          
          Show
          Tsz Wo Nicholas Sze added a comment - On a second thought, this may not be an incompatible change. Please correct me if I was wrong. However, I think it is good to add release note since it is not clear how to use the new scripts. The error message "Instead use the hdfs command for it." seems saying that we should use "./bin/hadoop hdfs ..." bash-3.2$ ./bin/hadoop dfsadmin DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Usage: java DFSAdmin [-report] [-safemode enter | leave | get | wait] [-saveNamespace] [-restoreFailedStorage true|false|check] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-refreshServiceAcl] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-setSpaceQuota <quota> <dirname>...<dirname>] [-clrSpaceQuota <dirname>...<dirname>] [-help [cmd]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] bash-3.2$ ./bin/hadoop hdfs java.lang.NoClassDefFoundError: hdfs Caused by: java.lang.ClassNotFoundException: hdfs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) Exception in thread "main" bash-3.2$
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This is an incompatible change. Please add release note.

          Show
          Tsz Wo Nicholas Sze added a comment - This is an incompatible change. Please add release note.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #756 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/756/ )
          Hide
          Sharad Agarwal added a comment -

          This is an incompatible change.

          Though we print the deprecated warning for hdfs and mapred comands, all commands are still available thru bin/hadoop. So I think that keeps this as a compatible change, no?

          Show
          Sharad Agarwal added a comment - This is an incompatible change. Though we print the deprecated warning for hdfs and mapred comands, all commands are still available thru bin/hadoop. So I think that keeps this as a compatible change, no?
          Hide
          Konstantin Shvachko added a comment -
          1. This is an incompatible change. Should be marked as such and should be committed under the right section.
          2. Yes, on Windows the svn update goes rather strangely. I cannot reproduce exactly what happened, but I had to first remove hdfs and mapred directories and then the update went through, and I have the bin/hdfs script working now.
          Show
          Konstantin Shvachko added a comment - This is an incompatible change. Should be marked as such and should be committed under the right section. Yes, on Windows the svn update goes rather strangely. I cannot reproduce exactly what happened, but I had to first remove hdfs and mapred directories and then the update went through, and I have the bin/hdfs script working now.
          Hide
          Sharad Agarwal added a comment -

          I think the problem is only with windows. Will submit a patch. Filed HADOOP-5212

          Show
          Sharad Agarwal added a comment - I think the problem is only with windows. Will submit a patch. Filed HADOOP-5212
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Forgot to mention in the comments that I've attached the logs for both scenarios.

          Show
          Vinod Kumar Vavilapalli added a comment - Forgot to mention in the comments that I've attached the logs for both scenarios.
          Hide
          Vinod Kumar Vavilapalli added a comment -
          • Using bin/hadoop, namenode doesn't even start at all, failing to resolve org.apache.hadoop.hdfs.server.namenode.NameNode.

          I didn't test it on Linux yet

          Show
          Vinod Kumar Vavilapalli added a comment - Using bin/hadoop, namenode doesn't even start at all, failing to resolve org.apache.hadoop.hdfs.server.namenode.NameNode. I didn't test it on Linux yet
          Hide
          Vinod Kumar Vavilapalli added a comment -

          There seem to be some problems running the scripts on Cygwin.

          • Using bin/hdfs, namenode is starting but i found some exception related to class resolution
          Show
          Vinod Kumar Vavilapalli added a comment - There seem to be some problems running the scripts on Cygwin. Using bin/hdfs, namenode is starting but i found some exception related to class resolution
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Sharad!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Sharad!
          Hide
          Owen O'Malley added a comment -

          +1 on the patch, assuming the test failures are unrelated.

          Show
          Owen O'Malley added a comment - +1 on the patch, assuming the test failures are unrelated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12399451/4868_v4.patch
          against trunk revision 742409.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399451/4868_v4.patch against trunk revision 742409. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3816/console This message is automatically generated.
          Hide
          Sharad Agarwal added a comment -

          ok, I have kept the scripts in bin/ itself. This keeps things simple and full backward compatibility is maintained. The scripts have been refactored in such a fashion that when the split happens, the scripts can go in their respective sub projects; and very few environment variable flips are required (only in the *-config.sh files).
          Tested by emulating the case where the bin/* scripts are run as if the project split has happened.
          Also tested in local and distributed mode.

          Should we name this "core-config.sh" to be consistent?

          I have kept the config name as same "hadoop-config.sh". I think we can retain the same name as the core script is also named "hadoop"

          Show
          Sharad Agarwal added a comment - ok, I have kept the scripts in bin/ itself. This keeps things simple and full backward compatibility is maintained. The scripts have been refactored in such a fashion that when the split happens, the scripts can go in their respective sub projects; and very few environment variable flips are required (only in the *-config.sh files). Tested by emulating the case where the bin/* scripts are run as if the project split has happened. Also tested in local and distributed mode. Should we name this "core-config.sh" to be consistent? I have kept the config name as same "hadoop-config.sh". I think we can retain the same name as the core script is also named "hadoop"
          Hide
          Devaraj Das added a comment -

          The patch looks good overall. Some things that should be verified before it gets committed:
          1) It would be good to test/emulate the case where the bin/* scripts are run as if the project split has happened
          2) TODO's for things that should be done when the actual split happens should be minimized overall
          3) Does it make sense to have symlinks in the current $HADOOP_HOME/bin/ directory that points to the new $HADOOP_HOME/core/bin/ files. That way, applications like HOD wouldn't break and these symlinks can be removed when the actual split happens.

          Show
          Devaraj Das added a comment - The patch looks good overall. Some things that should be verified before it gets committed: 1) It would be good to test/emulate the case where the bin/* scripts are run as if the project split has happened 2) TODO's for things that should be done when the actual split happens should be minimized overall 3) Does it make sense to have symlinks in the current $HADOOP_HOME/bin/ directory that points to the new $HADOOP_HOME/core/bin/ files. That way, applications like HOD wouldn't break and these symlinks can be removed when the actual split happens.
          Hide
          Doug Cutting added a comment -

          > These invoke the common-config.sh present in the core.

          Should we name this "core-config.sh" to be consistent?

          Show
          Doug Cutting added a comment - > These invoke the common-config.sh present in the core. Should we name this "core-config.sh" to be consistent?
          Hide
          Sharad Agarwal added a comment -

          this patch splits the scripts based on the proposal by Owen.

          • I have moved the core, mapred and hdfs specific scripts in bin/core, bin/mapred and bin/hdfs folder. They would get into the respective $HADOOP_HOME/ {core,hdfs,mapred}

            /bin folder when structure in repository is changed.

          • Added mapred-config.sh and hdfs-config.sh. These invoke the common-config.sh present in the core. Note I renamed the hadoop-config.sh to common-config.sh and moved some of the stuff from the old hadoop script to it. The idea is that common-config.sh will contain things which is common for all the sub projects.
          • common-config.sh loads the hdfs libraries if it is present (for fs command to work for hdfs)
          • conf is in $HADOOP_HOME
          • hadoop script continue to work for all the commands but its usage for mapred or hdfs commands has been deprecated.
          • I have put couple of TODO markers at places where the value of environment variable needs to be changed when the change in structure happens.
            Review ?
          Show
          Sharad Agarwal added a comment - this patch splits the scripts based on the proposal by Owen. I have moved the core, mapred and hdfs specific scripts in bin/core, bin/mapred and bin/hdfs folder. They would get into the respective $HADOOP_HOME/ {core,hdfs,mapred} /bin folder when structure in repository is changed. Added mapred-config.sh and hdfs-config.sh. These invoke the common-config.sh present in the core. Note I renamed the hadoop-config.sh to common-config.sh and moved some of the stuff from the old hadoop script to it. The idea is that common-config.sh will contain things which is common for all the sub projects. common-config.sh loads the hdfs libraries if it is present (for fs command to work for hdfs) conf is in $HADOOP_HOME hadoop script continue to work for all the commands but its usage for mapred or hdfs commands has been deprecated. I have put couple of TODO markers at places where the value of environment variable needs to be changed when the change in structure happens. Review ?
          Hide
          Doug Cutting added a comment -

          > Do you mean build dependencies?

          I think Owen's plan is to use Ivy for build-time dependencies on hadoop-core, but not to bundle the core jar into releases like other jars we get from Ivy, rather to force folks to explicitly install a compatible version of core. The idea is to make it easier for folks to separately upgrade hdfs, core and mapred.

          However will we run into library conflicts? It would be bad if core and hdfs bundle different versions of dependent jars. So I wonder if we should use Ivy at install time to get a single version of dependent libraries, or whether we should just be very careful not to, e.g., bundle libraries in hdfs that are also bundled in core. If we used Ivy for other libraries at install-time, that begs the question whether we should also use it to get core at install time. If we did this, then one would configure Ivy in order to upgrade to a newer version of core. But Owen's goal is to make Hadoop play well with Linux package managers, which probably means not using Ivy at install time.

          Show
          Doug Cutting added a comment - > Do you mean build dependencies? I think Owen's plan is to use Ivy for build-time dependencies on hadoop-core, but not to bundle the core jar into releases like other jars we get from Ivy, rather to force folks to explicitly install a compatible version of core. The idea is to make it easier for folks to separately upgrade hdfs, core and mapred. However will we run into library conflicts? It would be bad if core and hdfs bundle different versions of dependent jars. So I wonder if we should use Ivy at install time to get a single version of dependent libraries, or whether we should just be very careful not to, e.g., bundle libraries in hdfs that are also bundled in core. If we used Ivy for other libraries at install-time, that begs the question whether we should also use it to get core at install time. If we did this, then one would configure Ivy in order to upgrade to a newer version of core. But Owen's goal is to make Hadoop play well with Linux package managers, which probably means not using Ivy at install time.
          Hide
          Sharad Agarwal added a comment -

          Should we not use Ivy for the dependency of mapred and hdfs on core?

          Do you mean build dependencies? There is a separate jira for it - HADOOP-5102

          Show
          Sharad Agarwal added a comment - Should we not use Ivy for the dependency of mapred and hdfs on core? Do you mean build dependencies? There is a separate jira for it - HADOOP-5102
          Hide
          Doug Cutting added a comment -

          Should we not use Ivy for the dependency of mapred and hdfs on core?

          Show
          Doug Cutting added a comment - Should we not use Ivy for the dependency of mapred and hdfs on core?
          Hide
          Doug Cutting added a comment -

          > I think that core, mapred, and hdfs need to be downloadable and installable separately.

          This sounds reasonable and is probably a better approach long-term than what I advocated above.

          Show
          Doug Cutting added a comment - > I think that core, mapred, and hdfs need to be downloadable and installable separately. This sounds reasonable and is probably a better approach long-term than what I advocated above.
          Hide
          Owen O'Malley added a comment -

          I think we should take a modified option 1.

          I think that core, mapred, and hdfs need to be downloadable and installable separately.

          I would propose a structure that looks like:

          $HADOOP_HOME/

          {core,hdfs,mapred}

          and a separated out conf dir:

          $HADOOP_HOME/conf

          Of course, we should make it configurable like:

          HADOOP_CORE_HOME = $HADOOP_HOME/core
          HADOOP_MAPRED_HOME = $HADOOP_HOME/mapred
          HADOOP_HDFS_HOME= $HADOOP_HOME/hdfs
          HADOOP_CONF_DIR = $HADOOP_HOME/conf

          Each of the subproject subdirectories would contain:
          bin, lib, and doc

          So to upgrade, you just untar the new release of the subproject into the corresponding spot and you are done.
          It will also be easier to map this into rpms where you want to install core, mapred, and hdfs from individual rpms.

          The scripts should probably be $HADOOP_CORE_HOME/bin/hadoop, $HADOOP_MAPRED_HOME/bin/mapred, and $HADOOP_HDFS_HOME/bin/hdfs. The mapred and hdfs scripts will handle all of the work for their respective projects. The hadoop script will delegate to the mapred and hdfs scripts as appropriate.

          One note is that the hadoop script will need to include the hdfs jars on the classpath. Otherwise, the fs commands won't work for HDFS and that would be bad.

          Show
          Owen O'Malley added a comment - I think we should take a modified option 1. I think that core, mapred, and hdfs need to be downloadable and installable separately. I would propose a structure that looks like: $HADOOP_HOME/ {core,hdfs,mapred} and a separated out conf dir: $HADOOP_HOME/conf Of course, we should make it configurable like: HADOOP_CORE_HOME = $HADOOP_HOME/core HADOOP_MAPRED_HOME = $HADOOP_HOME/mapred HADOOP_HDFS_HOME= $HADOOP_HOME/hdfs HADOOP_CONF_DIR = $HADOOP_HOME/conf Each of the subproject subdirectories would contain: bin, lib, and doc So to upgrade, you just untar the new release of the subproject into the corresponding spot and you are done. It will also be easier to map this into rpms where you want to install core, mapred, and hdfs from individual rpms. The scripts should probably be $HADOOP_CORE_HOME/bin/hadoop, $HADOOP_MAPRED_HOME/bin/mapred, and $HADOOP_HDFS_HOME/bin/hdfs. The mapred and hdfs scripts will handle all of the work for their respective projects. The hadoop script will delegate to the mapred and hdfs scripts as appropriate. One note is that the hadoop script will need to include the hdfs jars on the classpath. Otherwise, the fs commands won't work for HDFS and that would be bad.
          Hide
          Doug Cutting added a comment -

          > Option 3 looks to be simpler.

          I agree. That's the sort of thing I had in mind. Nutch does something similar to include Hadoop in its builds.

          > One way could be by some variable CMD_DISPATCHER_CLASS which gets overridden in the mapred and hdfs.

          I was thinking this would be done based on the name of the primary sub-command (fs, job, pipes, etc.). We're already proposing to break back-compatibility, changing 'bin/hadoop job' to 'bin/hadoop-mapred job'. Under my proposal this might instead become 'bin/hadoop mapred job' (effectively just removing the dash). Note that 'bin/hadoop fs' could remain unchanged, since we have a package named 'fs'.

          We could preserve 100% compatibilty by placing all command dispatchers under org.apache.hadoop.command. So the 'job' command dispatcher could be org.apache.hadoop.command.job.Command or somesuch.

          > Just having this won't be sufficient as we need to print help messages listing all the available commands.

          To list all available commands we can scan the classpath (java.class.path), and, for each file or directory, scan it for org.apache.hadoop.command sub-packages.

          Show
          Doug Cutting added a comment - > Option 3 looks to be simpler. I agree. That's the sort of thing I had in mind. Nutch does something similar to include Hadoop in its builds. > One way could be by some variable CMD_DISPATCHER_CLASS which gets overridden in the mapred and hdfs. I was thinking this would be done based on the name of the primary sub-command (fs, job, pipes, etc.). We're already proposing to break back-compatibility, changing 'bin/hadoop job' to 'bin/hadoop-mapred job'. Under my proposal this might instead become 'bin/hadoop mapred job' (effectively just removing the dash). Note that 'bin/hadoop fs' could remain unchanged, since we have a package named 'fs'. We could preserve 100% compatibilty by placing all command dispatchers under org.apache.hadoop.command. So the 'job' command dispatcher could be org.apache.hadoop.command.job.Command or somesuch. > Just having this won't be sufficient as we need to print help messages listing all the available commands. To list all available commands we can scan the classpath (java.class.path), and, for each file or directory, scan it for org.apache.hadoop.command sub-packages.
          Hide
          Sharad Agarwal added a comment -

          I would prefer that the included scripts were not directly in the bin/ directory, but rather in lib/ or a subdirectory. The bin/ directory should ideally only contain end-user commands.

          +1. we can have bin/includes directory

          Also, once we split the projects, we'd like the combination of core & mapred and core & hdfs to be as simple as possible. Copying multiple scripts into directories seems fragile. Ideally we'd have a single shell script to bootstrap things and then get everything else from jars on the classpath, since we need to combine libraries (core, hdfs, & mapred) together on the classpath anyway.

          For combining I see these options:
          1. Install core separately before installing mapred or hdfs and refer it via environment variable say HADOOP_CORE_HOME
          2. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in a subdirectory of mapred and hdfs say core-release. Refer it via environment variable say HADOOP_CORE_RELEASE. By default it would point to mapred/core-release.
          3. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in such a fashion that contents of lib, conf and bin are copied to respective directories of mapred and hdfs.

          Option1 clearly not preferable as users would have to download and install two releases.
          For option 2, we would require to explicitly invoke scripts from the core and also would need to explicitly add libraries to classpath. There would be multiple conf folder, one for core and other for mapred/hdfs, which needs to be handled.
          Option 3 looks to be simpler. The hadoop script can add all the libraries present in the lib folder to the classpath, so it doesn't need to bother from where it came from. We have single conf folder, so most of the things remain as it is in terms of passing the different conf folder. This looks to be a good option, the only constraint is that the folder structure must remain the same for all - core, mapred and hdfs and there aren't clashes of filenames.

          Might it be simpler if the command dispatch were in Java? We might have a CoreCommand, plus MapredCommand and HdfsCommand subclasses.

          Do you mean we have only one hadoop script and don't need to have hadoop-mapred and hadoop-hdfs ? In that case bin/hadoop script would need to know which one it would call CoreCmdDispatcher, MapredCmdDisptacher or HDFSCmdDispatcher. One way could be by some variable CMD_DISPATCHER_CLASS which gets overridden in the mapred and hdfs. Not sure how, perhaps this variable can be set by the unpack script itself.
          Other way could be that CoreCmdDispatcher itself looks for the presence of MapredCmdDisptacher and HDFSCmdDispatcher in the classpath. If found then delegate to it. But this will mean reverse dependency although it won't be compile time.

          The bin/hadoop script (from core) might, when invoked with 'bin/hadoop foo ...' run something like org.apache.hadoop.foo.FooCommand. Then we wouldn't need the core.sh, mapred.sh and hdfs.sh include scripts.

          This is similar to current functionality of bin/hadoop <CLASSNAME>, no? Just having this won't be sufficient as we need to print help messages listing all the available commands.

          Show
          Sharad Agarwal added a comment - I would prefer that the included scripts were not directly in the bin/ directory, but rather in lib/ or a subdirectory. The bin/ directory should ideally only contain end-user commands. +1. we can have bin/includes directory Also, once we split the projects, we'd like the combination of core & mapred and core & hdfs to be as simple as possible. Copying multiple scripts into directories seems fragile. Ideally we'd have a single shell script to bootstrap things and then get everything else from jars on the classpath, since we need to combine libraries (core, hdfs, & mapred) together on the classpath anyway. For combining I see these options: 1. Install core separately before installing mapred or hdfs and refer it via environment variable say HADOOP_CORE_HOME 2. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in a subdirectory of mapred and hdfs say core-release. Refer it via environment variable say HADOOP_CORE_RELEASE. By default it would point to mapred/core-release. 3. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in such a fashion that contents of lib, conf and bin are copied to respective directories of mapred and hdfs. Option1 clearly not preferable as users would have to download and install two releases. For option 2, we would require to explicitly invoke scripts from the core and also would need to explicitly add libraries to classpath. There would be multiple conf folder, one for core and other for mapred/hdfs, which needs to be handled. Option 3 looks to be simpler. The hadoop script can add all the libraries present in the lib folder to the classpath, so it doesn't need to bother from where it came from. We have single conf folder, so most of the things remain as it is in terms of passing the different conf folder. This looks to be a good option, the only constraint is that the folder structure must remain the same for all - core, mapred and hdfs and there aren't clashes of filenames. Might it be simpler if the command dispatch were in Java? We might have a CoreCommand, plus MapredCommand and HdfsCommand subclasses. Do you mean we have only one hadoop script and don't need to have hadoop-mapred and hadoop-hdfs ? In that case bin/hadoop script would need to know which one it would call CoreCmdDispatcher, MapredCmdDisptacher or HDFSCmdDispatcher. One way could be by some variable CMD_DISPATCHER_CLASS which gets overridden in the mapred and hdfs. Not sure how, perhaps this variable can be set by the unpack script itself. Other way could be that CoreCmdDispatcher itself looks for the presence of MapredCmdDisptacher and HDFSCmdDispatcher in the classpath. If found then delegate to it. But this will mean reverse dependency although it won't be compile time. The bin/hadoop script (from core) might, when invoked with 'bin/hadoop foo ...' run something like org.apache.hadoop.foo.FooCommand. Then we wouldn't need the core.sh, mapred.sh and hdfs.sh include scripts. This is similar to current functionality of bin/hadoop <CLASSNAME>, no? Just having this won't be sufficient as we need to print help messages listing all the available commands.
          Hide
          Doug Cutting added a comment -

          I would prefer that the included scripts were not directly in the bin/ directory, but rather in lib/ or a subdirectory. The bin/ directory should ideally only contain end-user commands.

          Also, once we split the projects, we'd like the combination of core & mapred and core & hdfs to be as simple as possible. Copying multiple scripts into directories seems fragile. Ideally we'd have a single shell script to bootstrap things and then get everything else from jars on the classpath, since we need to combine libraries (core, hdfs, & mapred) together on the classpath anyway.

          Might it be simpler if the command dispatch were in Java? We might have a CoreCommand, plus MapredCommand and HdfsCommand subclasses. The bin/hadoop script (from core) might, when invoked with 'bin/hadoop foo ...' run something like org.apache.hadoop.foo.FooCommand. Then we wouldn't need the core.sh, mapred.sh and hdfs.sh include scripts.

          BTW, a perhaps little-known feature of hadoop is that it bundles the contents of bin/ into the jar, so that the jar contains (with a little unpacking) the tools needed to use it. We could continue this after the project splitup, so that all that, e.g., all that hdfs should need from a core release is its jar. When we build an hdfs release we can unpack bin/hadoop from the core jar.

          Show
          Doug Cutting added a comment - I would prefer that the included scripts were not directly in the bin/ directory, but rather in lib/ or a subdirectory. The bin/ directory should ideally only contain end-user commands. Also, once we split the projects, we'd like the combination of core & mapred and core & hdfs to be as simple as possible. Copying multiple scripts into directories seems fragile. Ideally we'd have a single shell script to bootstrap things and then get everything else from jars on the classpath, since we need to combine libraries (core, hdfs, & mapred) together on the classpath anyway. Might it be simpler if the command dispatch were in Java? We might have a CoreCommand, plus MapredCommand and HdfsCommand subclasses. The bin/hadoop script (from core) might, when invoked with 'bin/hadoop foo ...' run something like org.apache.hadoop.foo.FooCommand. Then we wouldn't need the core.sh, mapred.sh and hdfs.sh include scripts. BTW, a perhaps little-known feature of hadoop is that it bundles the contents of bin/ into the jar, so that the jar contains (with a little unpacking) the tools needed to use it. We could continue this after the project splitup, so that all that, e.g., all that hdfs should need from a core release is its jar. When we build an hdfs release we can unpack bin/hadoop from the core jar.
          Hide
          Sharad Agarwal added a comment -

          Attaching patch for review.
          Changes from the last patch:
          Removed the dependency of hadoop-daemon.sh on hadoop script. Now hadoop-daemon.sh takes an argument to invoke specific script -> hadoop-core, hadoop-mapred or hadoop-hdfs.
          Deprecated start-all.sh and stop-all.sh

          Show
          Sharad Agarwal added a comment - Attaching patch for review. Changes from the last patch: Removed the dependency of hadoop-daemon.sh on hadoop script. Now hadoop-daemon.sh takes an argument to invoke specific script -> hadoop-core, hadoop-mapred or hadoop-hdfs. Deprecated start-all.sh and stop-all.sh
          Hide
          Sharad Agarwal added a comment -

          this patch:
          creates hadoop-core, hadoop-hdfs and hadoop-mapred scripts. These can be invoked independently.
          creates include scripts: core.sh, hdfs.sh and mapred.sh
          hadoop-mapred and hadoop-hdfs includes core.sh.
          bin/hadoop script is deprecated and it includes core.sh, hdfs.sh and mapred.sh

          Show
          Sharad Agarwal added a comment - this patch: creates hadoop-core, hadoop-hdfs and hadoop-mapred scripts. These can be invoked independently. creates include scripts: core.sh, hdfs.sh and mapred.sh hadoop-mapred and hadoop-hdfs includes core.sh. bin/hadoop script is deprecated and it includes core.sh, hdfs.sh and mapred.sh

            People

            • Assignee:
              Sharad Agarwal
              Reporter:
              Sharad Agarwal
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development