Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1901

Jobs should not submit the same jar files over and over again

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Currently each Hadoop job uploads the required resources (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in executing this job would then download these resources into local disk.

      In an environment where most of the users are using a standard set of jars and files (because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency) is significant when:

      • the jobs are small (and conversantly - large in number)
      • Namenode is under load (meaning hdfs latencies are high and made worse, in part, by this protocol)

      Hadoop should provide a way for jobs in a cooperative environment to not submit the same files over and again. Identifying and caching execution resources by a content signature (md5/sha) would be a good alternative to have available.

      Attachments

        1. 1901.PATCH
          65 kB
          Junjie Liang
        2. 1901.PATCH
          55 kB
          Junjie Liang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            jsensarma Joydeep Sen Sarma
            Votes:
            0 Vote for this issue
            Watchers:
            30 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment