Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-17574

Avoid multiple copies of HDFS-based jars when localizing job-jars

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0, 2.4.0, 3.0.0
    • None
    • None
    • None

    Description

      Raising this on behalf of selinazh. (For my own reference: YHIVE-1035.)

      This has to do with the classpaths of Hive actions run from Oozie, and affects scripts that adds jars/resources from HDFS locations.

      As part of Oozie's "sharelib" deploys, foundation jars (such as Hive jars) tend to be stored in HDFS paths, as are any custom user-libraries used in workflows. An ADD JAR|FILE|ARCHIVE statement in a Hive script causes the following steps to occur:

      1. Files are downloaded from HDFS to local temp dir.
      2. UDFs are resolved/validated.
      3. All jars/files, including those just downloaded from HDFS, are shipped right back to HDFS-based scratch-directories, for job submission.

      For HDFS-based files, this is wasteful and time-consuming. #3 above should skip shipping HDFS-based resources, and add those directly to the Tez session.

      We have a patch that's being used internally at Yahoo.

      Attachments

        1. HIVE-17574.2.patch
          17 kB
          Mithun Radhakrishnan
        2. HIVE-17574.1.patch
          17 kB
          Mithun Radhakrishnan
        3. HIVE-17574.1-branch-2.patch
          18 kB
          Mithun Radhakrishnan
        4. HIVE-17574.1-branch-2.2.patch
          18 kB
          Mithun Radhakrishnan

        Issue Links

          Activity

            People

              cdrome Chris Drome
              mithun Mithun Radhakrishnan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: