I would prefer that the included scripts were not directly in the bin/ directory, but rather in lib/ or a subdirectory. The bin/ directory should ideally only contain end-user commands.
+1. we can have bin/includes directory
Also, once we split the projects, we'd like the combination of core & mapred and core & hdfs to be as simple as possible. Copying multiple scripts into directories seems fragile. Ideally we'd have a single shell script to bootstrap things and then get everything else from jars on the classpath, since we need to combine libraries (core, hdfs, & mapred) together on the classpath anyway.
For combining I see these options:
1. Install core separately before installing mapred or hdfs and refer it via environment variable say HADOOP_CORE_HOME
2. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in a subdirectory of mapred and hdfs say core-release. Refer it via environment variable say HADOOP_CORE_RELEASE. By default it would point to mapred/core-release.
3. Bundle core jar in mapred and hdfs' lib. There could be a target in build file say setup which would unpack the core jar in such a fashion that contents of lib, conf and bin are copied to respective directories of mapred and hdfs.
Option1 clearly not preferable as users would have to download and install two releases.
For option 2, we would require to explicitly invoke scripts from the core and also would need to explicitly add libraries to classpath. There would be multiple conf folder, one for core and other for mapred/hdfs, which needs to be handled.
Option 3 looks to be simpler. The hadoop script can add all the libraries present in the lib folder to the classpath, so it doesn't need to bother from where it came from. We have single conf folder, so most of the things remain as it is in terms of passing the different conf folder. This looks to be a good option, the only constraint is that the folder structure must remain the same for all - core, mapred and hdfs and there aren't clashes of filenames.
Might it be simpler if the command dispatch were in Java? We might have a CoreCommand, plus MapredCommand and HdfsCommand subclasses.
Do you mean we have only one hadoop script and don't need to have hadoop-mapred and hadoop-hdfs ? In that case bin/hadoop script would need to know which one it would call CoreCmdDispatcher, MapredCmdDisptacher or HDFSCmdDispatcher. One way could be by some variable CMD_DISPATCHER_CLASS which gets overridden in the mapred and hdfs. Not sure how, perhaps this variable can be set by the unpack script itself.
Other way could be that CoreCmdDispatcher itself looks for the presence of MapredCmdDisptacher and HDFSCmdDispatcher in the classpath. If found then delegate to it. But this will mean reverse dependency although it won't be compile time.
The bin/hadoop script (from core) might, when invoked with 'bin/hadoop foo ...' run something like org.apache.hadoop.foo.FooCommand. Then we wouldn't need the core.sh, mapred.sh and hdfs.sh include scripts.
This is similar to current functionality of bin/hadoop <CLASSNAME>, no? Just having this won't be sufficient as we need to print help messages listing all the available commands.