[HADOOP-9902] Shell script rewrite - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha1
Fix Version/s: 3.0.0-alpha1
Component/s: scripts
Labels:
- releasenotes

Target Version/s:
Hadoop Flags:

Incompatible change
Release Note:

Hide

The Hadoop shell scripts have been rewritten to fix many long standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

INCOMPATIBLE CHANGES:

* The pid and out files for secure daemons have been renamed to include the appropriate ${HADOOP\_IDENT\_STR}. This should allow, with proper configurations in place, for multiple versions of the same secure daemon to run on a host. Additionally, pid files are now created when daemons are run in interactive mode. This will also prevent the accidental starting of two daemons with the same configuration prior to launching java (i.e., "fast fail" without having to wait for socket opening).
* All Hadoop shell script subsystems now execute hadoop-env.sh, which allows for all of the environment variables to be in one location. This was not the case previously.
* The default content of *-env.sh has been significantly altered, with the majority of defaults moved into more protected areas inside the code. Additionally, these files do not auto-append anymore; setting a variable on the command line prior to calling a shell command must contain the entire content, not just any extra settings. This brings Hadoop more in-line with the vast majority of other software packages.
* All HDFS\_\*, YARN\_\*, and MAPRED\_\* environment variables act as overrides to their equivalent HADOOP\_\* environment variables when 'hdfs', 'yarn', 'mapred', and related commands are executed. Previously, these were separated out which meant a significant amount of duplication of common settings.
* hdfs-config.sh and hdfs-config.cmd were inadvertently duplicated into libexec and sbin. The sbin versions have been removed.
* The log4j settings forcibly set by some *-daemon.sh commands have been removed. These settings are now configurable in the \*-env.sh files via \*\_OPT.
* Support for various undocumented YARN log4j.properties files has been removed.
* Support for ${HADOOP\_MASTER} and the related rsync code have been removed.
* The undocumented and unused yarn.id.str Java property has been removed.
* The unused yarn.policy.file Java property has been removed.
* We now require bash v3 (released July 27, 2004) or better in order to take advantage of better regex handling and ${BASH\_SOURCE}. POSIX sh will not work.
* Support for --script has been removed. We now use ${HADOOP\_\*\_PATH} or ${HADOOP\_PREFIX} to find the necessary binaries. (See other note regarding ${HADOOP\_PREFIX} auto discovery.)
* Non-existent classpaths, ld.so library paths, JNI library paths, etc, will be ignored and stripped from their respective environment settings.

NEW FEATURES:

* Daemonization has been moved from *-daemon.sh to the bin commands via the --daemon option. Simply use --daemon start to start a daemon, --daemon stop to stop a daemon, and --daemon status to set $? to the daemon's status. The return code for status is LSB-compatible. For example, 'hdfs --daemon start namenode'.
* It is now possible to override some of the shell code capabilities to provide site specific functionality without replacing the shipped versions. Replacement functions should go into the new hadoop-user-functions.sh file.
* A new option called --buildpaths will attempt to add developer build directories to the classpath to allow for in source tree testing.
* Operations which trigger ssh connections can now use pdsh if installed. ${HADOOP\_SSH\_OPTS} still gets applied.
* Added distch and jnipath subcommands to the hadoop command.
* Shell scripts now support a --debug option which will report basic information on the construction of various environment variables, java options, classpath, etc. to help in configuration debugging.

BUG FIXES:

* ${HADOOP\_CONF\_DIR} is now properly honored everywhere, without requiring symlinking and other such tricks.
* ${HADOOP\_CONF\_DIR}/hadoop-layout.sh is now documented with a provided hadoop-layout.sh.example file.
* Shell commands should now work properly when called as a relative path, without ${HADOOP\_PREFIX} being defined, and as the target of bash -x for debugging. If ${HADOOP\_PREFIX} is not set, it will be automatically determined based upon the current location of the shell library. Note that other parts of the extended Hadoop ecosystem may still require this environment variable to be configured.
* Operations which trigger ssh will now limit the number of connections to run in parallel to ${HADOOP\_SSH\_PARALLEL} to prevent memory and network exhaustion. By default, this is set to 10.
* ${HADOOP\_CLIENT\_OPTS} support has been added to a few more commands.
* Some subcommands were not listed in the usage.
* Various options on hadoop command lines were supported inconsistently. These have been unified into hadoop-config.sh. --config is still required to be first, however.
* ulimit logging for secure daemons no longer assumes /bin/bash but does assume bash is on the command line path.
* Removed references to some Yahoo! specific paths.
* Removed unused slaves.sh from YARN build tree.
* Many exit states have been changed to reflect reality.
* Shell level errors now go to STDERR. Before, many of them went incorrectly to STDOUT.
* CDPATH with a period (.) should no longer break the scripts.
* The scripts no longer try to chown directories.
* If ${JAVA\_HOME} is not set on OS X, it now properly detects it instead of throwing an error.

IMPROVEMENTS:

* The *.out files are now appended instead of overwritten to allow for external log rotation.
* The style and layout of the scripts is much more consistent across subprojects.
* More of the shell code is now commented.
* Significant amounts of redundant code have been moved into a new file called hadoop-functions.sh.
* The various *-env.sh have been massively changed to include documentation and examples on what can be set, ramifications of setting, etc. for all variables that are expected to be set by a user.
* There is now some trivial de-duplication and sanitization of the classpath and JVM options. This allows, amongst other things, for custom settings in \*\_OPTS for Hadoop daemons to override defaults and other generic settings (i.e., ${HADOOP\_OPTS}). This is particularly relevant for Xmx settings, as one can now set them in _OPTS and ignore the heap specific options for daemons which force the size in megabytes.
* Subcommands have been alphabetized in both usage and in the code.
* All/most of the functionality provided by the sbin/* commands has been moved to either their bin/ equivalents or made into functions. The rewritten versions of these commands are now wrappers to maintain backward compatibility.
* Usage information is given with the following options/subcommands for all scripts using the common framework: --? -? ? --help -help -h help
* Several generic environment variables have been added to provide a common configuration for pids, logs, and their security equivalents. The older versions still act as overrides to these generic versions.
* Groundwork has been laid to allow for custom secure daemon setup using something other than jsvc (e.g., pfexec on Solaris).
* Scripts now test and report better error messages for various states of the log and pid dirs on daemon startup. Before, unprotected shell errors would be displayed to the user.

Show
 The Hadoop shell scripts have been rewritten to fix many long standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations. INCOMPATIBLE CHANGES: * The pid and out files for secure daemons have been renamed to include the appropriate ${HADOOP\_IDENT\_STR}. This should allow, with proper configurations in place, for multiple versions of the same secure daemon to run on a host. Additionally, pid files are now created when daemons are run in interactive mode. This will also prevent the accidental starting of two daemons with the same configuration prior to launching java (i.e., "fast fail" without having to wait for socket opening). * All Hadoop shell script subsystems now execute hadoop-env.sh, which allows for all of the environment variables to be in one location. This was not the case previously. * The default content of *-env.sh has been significantly altered, with the majority of defaults moved into more protected areas inside the code. Additionally, these files do not auto-append anymore; setting a variable on the command line prior to calling a shell command must contain the entire content, not just any extra settings. This brings Hadoop more in-line with the vast majority of other software packages. * All HDFS\_\*, YARN\_\*, and MAPRED\_\* environment variables act as overrides to their equivalent HADOOP\_\* environment variables when 'hdfs', 'yarn', 'mapred', and related commands are executed. Previously, these were separated out which meant a significant amount of duplication of common settings. * hdfs-config.sh and hdfs-config.cmd were inadvertently duplicated into libexec and sbin. The sbin versions have been removed. * The log4j settings forcibly set by some *-daemon.sh commands have been removed. These settings are now configurable in the \*-env.sh files via \*\_OPT. * Support for various undocumented YARN log4j.properties files has been removed. * Support for ${HADOOP\_MASTER} and the related rsync code have been removed. * The undocumented and unused yarn.id.str Java property has been removed. * The unused yarn.policy.file Java property has been removed. * We now require bash v3 (released July 27, 2004) or better in order to take advantage of better regex handling and ${BASH\_SOURCE}. POSIX sh will not work. * Support for --script has been removed. We now use ${HADOOP\_\*\_PATH} or ${HADOOP\_PREFIX} to find the necessary binaries. (See other note regarding ${HADOOP\_PREFIX} auto discovery.) * Non-existent classpaths, ld.so library paths, JNI library paths, etc, will be ignored and stripped from their respective environment settings. NEW FEATURES: * Daemonization has been moved from *-daemon.sh to the bin commands via the --daemon option. Simply use --daemon start to start a daemon, --daemon stop to stop a daemon, and --daemon status to set $? to the daemon's status. The return code for status is LSB-compatible. For example, 'hdfs --daemon start namenode'. * It is now possible to override some of the shell code capabilities to provide site specific functionality without replacing the shipped versions. Replacement functions should go into the new hadoop-user-functions.sh file. * A new option called --buildpaths will attempt to add developer build directories to the classpath to allow for in source tree testing. * Operations which trigger ssh connections can now use pdsh if installed. ${HADOOP\_SSH\_OPTS} still gets applied. * Added distch and jnipath subcommands to the hadoop command. * Shell scripts now support a --debug option which will report basic information on the construction of various environment variables, java options, classpath, etc. to help in configuration debugging. BUG FIXES: * ${HADOOP\_CONF\_DIR} is now properly honored everywhere, without requiring symlinking and other such tricks. * ${HADOOP\_CONF\_DIR}/hadoop-layout.sh is now documented with a provided hadoop-layout.sh.example file. * Shell commands should now work properly when called as a relative path, without ${HADOOP\_PREFIX} being defined, and as the target of bash -x for debugging. If ${HADOOP\_PREFIX} is not set, it will be automatically determined based upon the current location of the shell library. Note that other parts of the extended Hadoop ecosystem may still require this environment variable to be configured. * Operations which trigger ssh will now limit the number of connections to run in parallel to ${HADOOP\_SSH\_PARALLEL} to prevent memory and network exhaustion. By default, this is set to 10. * ${HADOOP\_CLIENT\_OPTS} support has been added to a few more commands. * Some subcommands were not listed in the usage. * Various options on hadoop command lines were supported inconsistently. These have been unified into hadoop-config.sh. --config is still required to be first, however. * ulimit logging for secure daemons no longer assumes /bin/bash but does assume bash is on the command line path. * Removed references to some Yahoo! specific paths. * Removed unused slaves.sh from YARN build tree. * Many exit states have been changed to reflect reality. * Shell level errors now go to STDERR. Before, many of them went incorrectly to STDOUT. * CDPATH with a period (.) should no longer break the scripts. * The scripts no longer try to chown directories. * If ${JAVA\_HOME} is not set on OS X, it now properly detects it instead of throwing an error. IMPROVEMENTS: * The *.out files are now appended instead of overwritten to allow for external log rotation. * The style and layout of the scripts is much more consistent across subprojects. * More of the shell code is now commented. * Significant amounts of redundant code have been moved into a new file called hadoop-functions.sh. * The various *-env.sh have been massively changed to include documentation and examples on what can be set, ramifications of setting, etc. for all variables that are expected to be set by a user. * There is now some trivial de-duplication and sanitization of the classpath and JVM options. This allows, amongst other things, for custom settings in \*\_OPTS for Hadoop daemons to override defaults and other generic settings (i.e., ${HADOOP\_OPTS}). This is particularly relevant for Xmx settings, as one can now set them in _OPTS and ignore the heap specific options for daemons which force the size in megabytes. * Subcommands have been alphabetized in both usage and in the code. * All/most of the functionality provided by the sbin/* commands has been moved to either their bin/ equivalents or made into functions. The rewritten versions of these commands are now wrappers to maintain backward compatibility. * Usage information is given with the following options/subcommands for all scripts using the common framework: --? -? ? --help -help -h help * Several generic environment variables have been added to provide a common configuration for pids, logs, and their security equivalents. The older versions still act as overrides to these generic versions. * Groundwork has been laid to allow for custom secure daemon setup using something other than jsvc (e.g., pfexec on Solaris). * Scripts now test and report better error messages for various states of the log and pid dirs on daemon startup. Before, unprotected shell errors would be displayed to the user.

Description

Umbrella JIRA for shell script rewrite. See more-info.txt for more details.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-9902-16.patch
18/Aug/14 22:53
193 kB
Allen Wittenauer
HADOOP-9902-15.patch
18/Aug/14 22:01
193 kB
Allen Wittenauer
HADOOP-9902-14.patch
08/Aug/14 06:43
193 kB
Allen Wittenauer
HADOOP-9902-13-branch-2.patch
05/Aug/14 18:44
190 kB
Allen Wittenauer
HADOOP-9902-13.patch
05/Aug/14 18:35
191 kB
Allen Wittenauer
HADOOP-9902-12.patch
04/Aug/14 22:40
191 kB
Allen Wittenauer
HADOOP-9902-11.patch
30/Jul/14 22:58
191 kB
Allen Wittenauer
HADOOP-9902-10.patch
28/Jul/14 06:50
190 kB
Allen Wittenauer
HADOOP-9902-9.patch
26/Jul/14 03:19
190 kB
Allen Wittenauer
HADOOP-9902-8.patch
23/Jul/14 23:10
187 kB
Allen Wittenauer
HADOOP-9902-7.patch
22/Jul/14 17:33
187 kB
Allen Wittenauer
HADOOP-9902-6.patch
16/Jul/14 18:26
186 kB
Allen Wittenauer
HADOOP-9902-5.patch
14/Jul/14 22:35
186 kB
Allen Wittenauer
HADOOP-9902-4.patch
14/Jul/14 19:12
185 kB
Allen Wittenauer
HADOOP-9902-3.patch
06/Jul/14 17:42
165 kB
Allen Wittenauer
HADOOP-9902-2.patch
02/Jul/14 00:19
151 kB
Allen Wittenauer
HADOOP-9902.patch
30/May/14 22:45
142 kB
Allen Wittenauer
HADOOP-9902.txt
11/Feb/14 04:48
86 kB
Allen Wittenauer
hadoop-9902-1.patch
16/Sep/13 22:41
101 kB
Allen Wittenauer
more-info.txt
25/Aug/13 00:47
3 kB
Allen Wittenauer

Issue Links

blocks

HADOOP-11010 Post-9902 "Umbrella" JIRA

Resolved

contains

HADOOP-6135 Fix hadoop-config.sh to work with symlinked bin directory

Resolved

HADOOP-7825 Hadoop wrapper script not picking up native libs correctly

Resolved

HADOOP-7906 haoop-daemon.sh unconditionnally try to chown its log directory

Resolved

HADOOP-8448 Java options being duplicated several times

Resolved

HADOOP-8464 hadoop-env.sh is included twice: once via hadoop-config.sh the again explicitly via scripts

Resolved

HADOOP-9351 Hadoop daemon startup scripts cause duplication of command line arguments

Resolved

HADOOP-9870 Mixed configurations for JVM -Xmx in hadoop command

Resolved

HADOOP-13364 Variable HADOOP_LIBEXEC_DIR must be quoted in bin/hadoop line 26

Resolved

HDFS-2715 start-dfs.sh falsely warns about processes already running

Resolved

HADOOP-7203 Folder Paths

Resolved

HADOOP-8476 Remove duplicate VM arguments for hadoop deamon

Resolved

HADOOP-9873 hadoop-env.sh got called multiple times

Resolved

HADOOP-10978 HADOOP_IDENT_STRING is overriden in hadoop-env.sh

Resolved

HDFS-1492 Secondary NameNode starting issue

Resolved

MAPREDUCE-3051 HADOOP_CONF_DIR exported twice in the classpath

Resolved

MAPREDUCE-3432 Yarn doesn't work if JAVA_HOME isn't set

Resolved

MAPREDUCE-5621 mr-jobhistory-daemon.sh doesn't have to execute mkdir and chown all the time

Resolved

YARN-3693 Duplicate parameters on service start for NM and RM

Resolved

HDFS-1326 Provide pluggable mechanism for securing datanodes

Resolved

HADOOP-1947 the hadoop-daemon.sh should allow the admin to configure the log4j appender for the servers

Resolved

HADOOP-7572 Ability to run the daemons from source trees

Resolved

HDFS-7745 HDFS should have its own daemon command and not rely on the one in common

Resolved

YARN-356 Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env

Resolved

HADOOP-7586 Hadoop daemon does not clean up pid file upon normal shutdown

Resolved

HADOOP-8505 hadoop scripts to support user native lib dirs

Resolved

HADOOP-9109 Support remote shell comands other than ssh in startup scripts

Resolved

HDFS-5087 Allowing specific JAVA heap max setting for HDFS related services

Resolved

HADOOP-8797 automatically detect JAVA_HOME on Linux, report native lib path similar to class path

Resolved

HADOOP-9979 HADOOP_IDENT_STRING should not be changed in hadoop-env.sh for hadoop daemons running as services

Resolved

duplicates

HADOOP-6167 bin/hadoop script doesn't allow for different memory settings for each daemon type

Resolved

HADOOP-6179 Modify hadoop scripts to pick correct jar from hadoop-mapred jar from HADOOP_HOME

Resolved

HADOOP-6746 hadoop-daemon.sh does not append to log files

Resolved

HADOOP-8026 various shell script fixes

Resolved

HADOOP-8792 hadoop-daemon doesn't handle chown failures

Resolved

HADOOP-10245 Hadoop command line always appends "-Xmx" option twice

Resolved

HDFS-1281 Fix '$bin' path duplication in setup scripts

Resolved

HADOOP-6851 Fix '$bin' path duplication in setup scripts

Resolved

HADOOP-1947 the hadoop-daemon.sh should allow the admin to configure the log4j appender for the servers

Resolved

YARN-115 yarn commands shouldn't add "m" to the heapsize

Resolved

HADOOP-5617 make chukwa log4j configuration more transparent from hadoop

Resolved

incorporates

HADOOP-6368 hadoop classpath is too long

Resolved

HADOOP-5787 Allow HADOOP_ROOT_LOGGER to be configured via conf/hadoop-env.sh

Resolved

YARN-2346 Add a 'status' command to yarn-daemon.sh

Resolved

is depended upon by

HADOOP-15009 hadoop-resourceestimator's shell scripts are a mess

Patch Available

is duplicated by

HADOOP-12574 Multiple Xmx Parameters to Java Process

Resolved

is related to

HDFS-4763 Add script changes/utility for starting NFS gateway

Closed

MAPREDUCE-3954 Clean up passing HEAPSIZE to yarn and mapred commands.

Resolved

HADOOP-12364 Deleting pid file after stop is causing the daemons to keep restarting

Resolved

HDFS-272 Update startup scripts to start Checkpoint node instead of SecondaryNameNode

Resolved

HDFS-5087 Allowing specific JAVA heap max setting for HDFS related services

Resolved

relates to

HADOOP-14921 Conflicts when starting daemons with the same name

Open

HADOOP-7518 Unable to start HDFS cluster on trunk (after the common mavenisation)

Resolved

HADOOP-7838 sbin/start-balancer doesnt

Resolved

HADOOP-8092 Hadoop DataNode cannot start up in Pseudo-Distributed mode using start-all.sh if it is run as root

Resolved

HDFS-11245 HDFS ignores HADOOP_CONF_DIR

Resolved

MAPREDUCE-727 Move the bin/hadoop jar command over to bin/mapred

Resolved

HADOOP-10177 Create CLI tools for managing keys via the KeyProvider API

Closed

MAPREDUCE-4649 mr-jobhistory-daemon.sh needs to be updated post YARN-1

Closed

YARN-1429 *nix: Allow a way for users to augment classpath of YARN daemons

Closed

HBASE-13231 shell script rewrite

Closed

HADOOP-2689 RegEx support for expressing datanodes in the slaves conf files

Open

HADOOP-10912 Modify scripts to use relative paths

Open

HADOOP-8222 bin/hadoop should allow callers to set jsvc pidfile even when not-detached

Resolved

HDFS-383 Modify datanode configs to specify minimum JVM heapsize

Resolved

HDFS-2256 we should add a wait for non-safe mode and call dfsadmin -report in start-dfs

Resolved

YARN-1118 Improve help message for $ yarn node

Resolved

YARN-1117 Improve help message for $ yarn applications and $yarn node

Closed

links to

UnixShellScriptProgrammingGuide

(25 contains, 11 duplicates, 3 incorporates, 1 is depended upon by, 1 is duplicated by, 5 is related to, 17 relates to, 1 links to)

Activity

People

Assignee:: Allen Wittenauer

Reporter:: Allen Wittenauer

Votes:: 0 Vote for this issue

Watchers:: 55 Start watching this issue

Dates

Created:: 25/Aug/13 00:32

Updated:: 02/Nov/17 21:35

Resolved:: 19/Aug/14 12:11