Hadoop Common
  1. Hadoop Common
  2. HADOOP-9082

Simplify scripting usages so that parallel platform-dependent scripts are simple to maintain

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This issue was formerly titled "Select and document a platform-independent scripting language for use in Hadoop environment" and was discussed in the common-dev@ threads "[PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack" and "[VOTE] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack".

      In that discussion the community consensus rejected the idea of using Python as a cross-platform scripting language.

      It is now proposed to follow up Allen's suggestions below about simplifying the use of scripts in Hadoop, bring more functionality into core code, and try to leave us with only trivial usages of scripting that can readily be maintained in as many platform-specific scripting languages as necessary.

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          This discussion started in HADOOP-8924, where it was proposed to replace the build-time utility "saveVersion.sh" with a python script. This would require Python as a build-time dependency. Here's the background:

          Those of us involved in the branch-1-win port of Hadoop to Windows without use of Cygwin, have faced the issue of frequent use of shell scripts throughout the system, both in build time (eg, the utility "saveVersion.sh"), and run time (config files like "hadoop-env.sh" and the start/stop scripts in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all projects.

          The vast majority of these shell scripts do not do anything platform specific; they can be expressed in a posix-conforming way. Therefore, it seems to us that it makes sense to start using a cross-platform scripting language, such as python, in place of shell for these purposes. For those rare occasions where platform-specific functionality really is needed, python also supports quite a lot of platform-specific functionality on both Linux and Windows; but where that is inadequate, one could still conditionally invoke a platform-specific module written in shell (for Linux/*nix) or powershell or bat (for Windows).

          The primary motive for moving to a cross-platform scripting language is maintainability. The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash.

          Show
          Matt Foley added a comment - This discussion started in HADOOP-8924 , where it was proposed to replace the build-time utility "saveVersion.sh" with a python script. This would require Python as a build-time dependency. Here's the background: Those of us involved in the branch-1-win port of Hadoop to Windows without use of Cygwin, have faced the issue of frequent use of shell scripts throughout the system, both in build time (eg, the utility "saveVersion.sh"), and run time (config files like "hadoop-env.sh" and the start/stop scripts in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all projects. The vast majority of these shell scripts do not do anything platform specific; they can be expressed in a posix-conforming way. Therefore, it seems to us that it makes sense to start using a cross-platform scripting language, such as python, in place of shell for these purposes. For those rare occasions where platform-specific functionality really is needed, python also supports quite a lot of platform-specific functionality on both Linux and Windows; but where that is inadequate, one could still conditionally invoke a platform-specific module written in shell (for Linux/*nix) or powershell or bat (for Windows). The primary motive for moving to a cross-platform scripting language is maintainability. The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash.
          Hide
          Matt Foley added a comment -

          Why python:

          • There are already a few instances of python usage in Hadoop, such as the utility (currently broken) "relnotes.py", and massive usage of python in the examples/ and contrib/ directories.
          • Python is also used in Bigtop build-time.
          • The Python language is available for free on essentially all platforms, under an Apache-compatible license.
          • It is supported in Eclipse and similar IDEs.
          • Most importantly, it is widely accepted as a reasonably good OO scripting language, and it is easily learned by anyone who already knows shell or perl, or other common scripting languages.
          • On the Tiobe index of programming language popularity, which seeks to measure the relative number of software engineers who know and use each language, Python far exceeds Perl and Ruby. The only more well-known scripting languages are PHP and Visual Basic, neither of which seems a prime candidate for this use.
          Show
          Matt Foley added a comment - Why python: There are already a few instances of python usage in Hadoop, such as the utility (currently broken) "relnotes.py", and massive usage of python in the examples/ and contrib/ directories. Python is also used in Bigtop build-time. The Python language is available for free on essentially all platforms, under an Apache-compatible license . It is supported in Eclipse and similar IDEs. Most importantly, it is widely accepted as a reasonably good OO scripting language, and it is easily learned by anyone who already knows shell or perl, or other common scripting languages. On the Tiobe index of programming language popularity , which seeks to measure the relative number of software engineers who know and use each language, Python far exceeds Perl and Ruby. The only more well-known scripting languages are PHP and Visual Basic, neither of which seems a prime candidate for this use.
          Hide
          Radim Kolar added a comment -

          Groovy would be best. It requires no user installation. You can run groovy scripts directly from maven via groovy maven plugin and at runtime you need just to add groovy-all jar to classpath.

          Show
          Radim Kolar added a comment - Groovy would be best. It requires no user installation. You can run groovy scripts directly from maven via groovy maven plugin and at runtime you need just to add groovy-all jar to classpath.
          Hide
          Doug Cutting added a comment -

          So on Windows this replaces the dependency on Cygwin with one on Python. Why is Cygwin unacceptable?

          Show
          Doug Cutting added a comment - So on Windows this replaces the dependency on Cygwin with one on Python. Why is Cygwin unacceptable?
          Hide
          Allen Wittenauer added a comment -

          (I know this is mostly going to get ignored because a) it's from me, b) it's more than 3 lines, and c) we've already proven that we only care about Linux despite people wanting support for other platforms, but here we go anyway.)

          While I can understand the build-time issues, I'm not sure I understand the run-time issues. If you are running on a system that doesn't have libhadoop or want to launch a task, you're going to hit a fork() and that's going to call bash (or potentially sh). Or are we planning on replacing taskjvm.sh as well? So the bash requirement doesn't go away.

          At run-time, the whole purpose of these scripts is to launch Java. That's it. The problem that we have is that our current scripts are extremely convoluted, wrap into themselves, and fundamentally aren't written very well. Arguing that we can make our launcher scripts object oriented or using an IDE to debug them seems like we're expecting to raise the complexity to even more ludicrous levels.

          One thing I'm very curious about is if we'll lose the $

          {BASH_SOURCE}

          functionality, something I considering absolutely critical, by moving to Python. (It allows one to run without setting any environment variables. I think I submitted that as a patch years ago, but well...)

          Let's say we pick Python. Which version are we going to target? From a support perspective, we could very easily end up asking about not only the Java version but the Python version. Do we really want that?

          The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future).

          This is what most projects do that have Windows and UNIX functionality, from what I've seen. This is because things are in different locations, delimiters, etc, etc and if you merge them, you end up with a lot of "if this then that, or if this2, then that2" to the point that you essentially have two different suites of scripts but just stored in one anyway.

          We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash.

          I think this is the real message: the "Linux developers.. which should be read as "Java developers who work on Hadoop" don't know bash and fundamentally ignore most attempts from outside to improve them. Switching to something else isn't going to change this problem. Instead, it'll just allow for them to continue ignoring the community in favor of their own changes.

          Perhaps the fundamental problem is this: Why are so many launcher changes even necessary? Why isn't Hadoop smart enough to figure out some of these things after Java is launched? Have we even seriously attempted a simplification of the scripts? (I suspect just using functions instead of the craziness around exported variables would make a world of difference.) Has there been any thought about actually creating real configuration files built by installers so we don't have to recompute a half-dozen things at every run time?

          Side-note: it would be interesting to see the memory footprint requirement differences on something like one of Yahoo!'s gateways. Sure, individually it isn't much. But at scale...

          Anyway, I've given my $0.02. Do what you want, I won't stop you. But I do question the thinking behind it.

          Show
          Allen Wittenauer added a comment - (I know this is mostly going to get ignored because a) it's from me, b) it's more than 3 lines, and c) we've already proven that we only care about Linux despite people wanting support for other platforms, but here we go anyway.) While I can understand the build-time issues, I'm not sure I understand the run-time issues. If you are running on a system that doesn't have libhadoop or want to launch a task, you're going to hit a fork() and that's going to call bash (or potentially sh). Or are we planning on replacing taskjvm.sh as well? So the bash requirement doesn't go away. At run-time, the whole purpose of these scripts is to launch Java. That's it. The problem that we have is that our current scripts are extremely convoluted, wrap into themselves, and fundamentally aren't written very well. Arguing that we can make our launcher scripts object oriented or using an IDE to debug them seems like we're expecting to raise the complexity to even more ludicrous levels. One thing I'm very curious about is if we'll lose the $ {BASH_SOURCE} functionality, something I considering absolutely critical, by moving to Python. (It allows one to run without setting any environment variables. I think I submitted that as a patch years ago, but well...) Let's say we pick Python. Which version are we going to target? From a support perspective, we could very easily end up asking about not only the Java version but the Python version. Do we really want that? The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). This is what most projects do that have Windows and UNIX functionality, from what I've seen. This is because things are in different locations, delimiters, etc, etc and if you merge them, you end up with a lot of "if this then that, or if this2, then that2" to the point that you essentially have two different suites of scripts but just stored in one anyway. We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash. I think this is the real message: the "Linux developers.. which should be read as "Java developers who work on Hadoop" don't know bash and fundamentally ignore most attempts from outside to improve them. Switching to something else isn't going to change this problem. Instead, it'll just allow for them to continue ignoring the community in favor of their own changes. Perhaps the fundamental problem is this: Why are so many launcher changes even necessary? Why isn't Hadoop smart enough to figure out some of these things after Java is launched? Have we even seriously attempted a simplification of the scripts? (I suspect just using functions instead of the craziness around exported variables would make a world of difference.) Has there been any thought about actually creating real configuration files built by installers so we don't have to recompute a half-dozen things at every run time? Side-note: it would be interesting to see the memory footprint requirement differences on something like one of Yahoo!'s gateways. Sure, individually it isn't much. But at scale... Anyway, I've given my $0.02. Do what you want, I won't stop you. But I do question the thinking behind it.
          Hide
          Doug Cutting added a comment -

          I agree with Allen. We should minimize scripting. I fear that Python would encourage more scripting when what we need is less. Scripts should be limited to things that cannot be done in Java. Parallel versions (bash & bat) of a few simple scripts may be a necessary evil if Cygwin is unacceptable.

          Show
          Doug Cutting added a comment - I agree with Allen. We should minimize scripting. I fear that Python would encourage more scripting when what we need is less. Scripts should be limited to things that cannot be done in Java. Parallel versions (bash & bat) of a few simple scripts may be a necessary evil if Cygwin is unacceptable.
          Hide
          Eli Collins added a comment -

          I agree with Allen and Doug as well. IMO we should make the existing bash scripts simpler such that maintaining a parallel set of BAT files (if necessary) isn't a big deal.

          Show
          Eli Collins added a comment - I agree with Allen and Doug as well. IMO we should make the existing bash scripts simpler such that maintaining a parallel set of BAT files (if necessary) isn't a big deal.
          Hide
          Matt Foley added a comment -

          I agree too. Repurposing this jira to focus discussion on that direction. Will also reopen HADOOP-8924 for a non-python solution to the specific problem of saveVersion.sh.

          Show
          Matt Foley added a comment - I agree too. Repurposing this jira to focus discussion on that direction. Will also reopen HADOOP-8924 for a non-python solution to the specific problem of saveVersion.sh.

            People

            • Assignee:
              Unassigned
              Reporter:
              Matt Foley
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:

                Development