Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12363

Hadoop binary distributions contain many copies of the same jars

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      I noticed this 2 years ago but this is bugging me again so I'm finally filing a bug ;o

      The Hadoop binary distribution is insanely redundant. Over 80% of the size of the ~200MB tarballs distributed both by Apache upstream and by Cloudera is made of duplicate files.

      Back when I was complaining about CDH 4.4.0, the Hadoop tarball contained 3477 duplicate files, some of which had 98 copies in the tarball!

      Now I'm looking at the official hadoop-2.7.1.tar.gz and I'm seeing 7 copies of jackson-mapper-asl-1.9.13.jar, jersey-server-1.9.jar, protobuf-java-2.5.0.jar, etc, 6 copies of guava-11.0.2.jar, xz-1.0.jar, commons-logging-1.1.3.jar, etc, 5 copies of snappy-java-1.0.4.1.jar, etc etc etc. All in all there are well over 200 files that appear at least twice in the tarball, and that account for 118MB worth of files that could just be replaced with a symlink (assuming you don't want to change the structure of the tarball at all).

      This is really not necessary

      Can we fix the distribution? I'm sure Cloudera and others will fix their distributions as well once this is fixed upstream (their distros exhibit a substantially more acute version of this problem).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tsuna Benoit Sigoure
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment