Pig
  1. Pig
  2. PIG-1857

Create an package integration project

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.1, 0.10.0
    • Component/s: build
    • Labels:
      None
    • Environment:

      RHEL 5.5/Ubuntu 10.10

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Create PIG rpm and deb packages.

      Description

      This goal of this ticket is to generate a set of RPM/debian package which integrate well with RPM sets created by HADOOP-6255.

      1. PIG-1857-7.patch
        32 kB
        Daniel Dai
      2. PIG-1857-6.patch
        31 kB
        Eric Yang
      3. PIG-1857-5.patch
        31 kB
        Eric Yang
      4. PIG-1857-4.patch
        31 kB
        Eric Yang
      5. PIG-1857-3.patch
        31 kB
        Eric Yang
      6. PIG-1857-2.patch
        31 kB
        Eric Yang
      7. PIG-1857-1.patch
        23 kB
        Eric Yang
      8. PIG-1857.patch
        23 kB
        Eric Yang
      9. PIG-1857-draft.patch
        25 kB
        Eric Yang

        Issue Links

          Activity

          Hide
          Eric Yang added a comment -

          RPM deployment should utilize hadoop jars from hadoop rpm file. It should be possible to create a default /etc/defaults/pig-env.sh which explicitly defines PIG_CLASSPATH=$

          {HADOOP_CORE_JAR}

          :...:$

          {HBASE_JAR}

          :$

          {HADOOP_CONF_DIR}

          This will ensure pig environment is setup properly by the RPM packages without burden Pig developer nor application developer to setup the environment.

          Show
          Eric Yang added a comment - RPM deployment should utilize hadoop jars from hadoop rpm file. It should be possible to create a default /etc/defaults/pig-env.sh which explicitly defines PIG_CLASSPATH=$ {HADOOP_CORE_JAR} :...:$ {HBASE_JAR} :$ {HADOOP_CONF_DIR} This will ensure pig environment is setup properly by the RPM packages without burden Pig developer nor application developer to setup the environment.
          Hide
          Eric Yang added a comment -

          Preview of pig rpm/deb package. /etc/default/pig-env.sh contains the proper setup of hadoop and hbase environment, if those software are deployed using package management system.

          The outstanding issue is pig.jar bundling hadoop classes. It would be ideal to have hadoop jars decoupled from pig.jar.

          Show
          Eric Yang added a comment - Preview of pig rpm/deb package. /etc/default/pig-env.sh contains the proper setup of hadoop and hbase environment, if those software are deployed using package management system. The outstanding issue is pig.jar bundling hadoop classes. It would be ideal to have hadoop jars decoupled from pig.jar.
          Hide
          Eric Yang added a comment -

          Usage:

          ant "rpm" to create rpm package.
          ant "deb" to create deb package.

          Show
          Eric Yang added a comment - Usage: ant "rpm" to create rpm package. ant "deb" to create deb package.
          Hide
          Eric Yang added a comment -

          Revised patch to bundle pig-[version]withouthadoop.jar as pig[version]-core.jar for rpm/deb packages.

          Show
          Eric Yang added a comment - Revised patch to bundle pig- [version] withouthadoop.jar as pig [version] -core.jar for rpm/deb packages.
          Hide
          Eric Yang added a comment -

          Patch generation mistake on PIG-1857.patch resubmit again with PIG-1857-1.patch

          Show
          Eric Yang added a comment - Patch generation mistake on PIG-1857 .patch resubmit again with PIG-1857 -1.patch
          Hide
          Eric Yang added a comment -

          Updated RPM/Debian packaging to layout files according to proposed file structure layout in HADOOP-6255. Pig RPM/Debian package currently depends on 0.20-security branch version of Hadoop.

          Show
          Eric Yang added a comment - Updated RPM/Debian packaging to layout files according to proposed file structure layout in HADOOP-6255 . Pig RPM/Debian package currently depends on 0.20-security branch version of Hadoop.
          Hide
          Alan Gates added a comment -

          I am reviewing this patch.

          Show
          Alan Gates added a comment - I am reviewing this patch.
          Hide
          Alan Gates added a comment -

          A few initial comments:

          In rpm/spec/pig.spec the dependencies are set to "Requires: hadoop, sh-utils, textutils, jdk >= 1.6". But in deb/pig.control/control they are set to "Depends: openjdk-6-jre-headless". Shouldn't these match?

          nit: the build target "source" should be called source-distribution or something. "source" sounds like it checks out the source.

          I still need to test this and think more about how it arranges files.

          Show
          Alan Gates added a comment - A few initial comments: In rpm/spec/pig.spec the dependencies are set to "Requires: hadoop, sh-utils, textutils, jdk >= 1.6". But in deb/pig.control/control they are set to "Depends: openjdk-6-jre-headless". Shouldn't these match? nit: the build target "source" should be called source-distribution or something. "source" sounds like it checks out the source. I still need to test this and think more about how it arranges files.
          Hide
          Eric Yang added a comment -

          The meta packages do not match between debian and redhat. On Debian, it depends on openjdk-6-jre-headless because Oracle does not offer Debian java package on their website. Hence, the dependency is set to the default jdk provided by Debian (Same for ubuntu). I will add "hadoop" to debian control file. For sh-utils and textutils, those packages are part of base OS, hence no package needs to be defined.

          "source" change to "source-distribution" works for me too.

          Show
          Eric Yang added a comment - The meta packages do not match between debian and redhat. On Debian, it depends on openjdk-6-jre-headless because Oracle does not offer Debian java package on their website. Hence, the dependency is set to the default jdk provided by Debian (Same for ubuntu). I will add "hadoop" to debian control file. For sh-utils and textutils, those packages are part of base OS, hence no package needs to be defined. "source" change to "source-distribution" works for me too.
          Hide
          Alan Gates added a comment -

          I've tested the following with this:

          1. I can build rpms on Linux
          2. I can build debs on Linux
          3. I tried to build rpms on Mac. This failed. I have the current version of the rpm package from MacPorts. Maybe I need something more recent. I don't see this as a show stopper, but it would be nice. See below for the error message.
          4. I can build debs on Mac
          5. When I build a tarball distribution (the one we've done up until now) with this it works as before with an installed cluster and in local mode. So this seems backward compatible. That's good.

          What I haven't tested yet that still needs testing:

          1. That I can install the resulting rpm and that Pig runs ok when I do. In particular I am interested to see how users run Pig when the RPM is installed. Do they still need to set PIG_CLASSPATH to point to their cluster or does the RPM installation take care of all that.
          2. That I can install the resulting deb and that Pig runs ok when I do. I don't have any debian based systems handy to test this.

          A general question as well. My initial impression was that there would be separate rpms and debs for binary and source distributions, as there are for maven packages. But it looking at the resulting rpms and debs it looks instead like source and executables have all been placed in one rpm/deb. Is that intentional? Is it what users expect? These things usually seem to be divided in rpms.

          The error message I get when building rpms on my mac:

          rpm:
              [mkdir] Created dir: /tmp/pig_package_build_gates/BUILD
              [mkdir] Created dir: /tmp/pig_package_build_gates/RPMS
              [mkdir] Created dir: /tmp/pig_package_build_gates/SRPMS
              [mkdir] Created dir: /tmp/pig_package_build_gates/SOURCES
              [mkdir] Created dir: /tmp/pig_package_build_gates/SPECS
               [copy] Copying 2 files to /tmp/pig_package_build_gates/SOURCES
               [copy] Copying 1 file to /tmp/pig_package_build_gates/SPECS
                [rpm] Building the RPM based on the pig.spec file
                [rpm] error: Unable to open temp file.
                [rpm]     Unable to open temp file.
                [rpm] 
                [rpm] 
                [rpm] RPM build errors:
          
          BUILD FAILED
          /Users/gates/src/pig/top/1857/trunk/build.xml:885: '/opt/local/bin/rpmbuild' failed with exit code 1
          
          Show
          Alan Gates added a comment - I've tested the following with this: I can build rpms on Linux I can build debs on Linux I tried to build rpms on Mac. This failed. I have the current version of the rpm package from MacPorts. Maybe I need something more recent. I don't see this as a show stopper, but it would be nice. See below for the error message. I can build debs on Mac When I build a tarball distribution (the one we've done up until now) with this it works as before with an installed cluster and in local mode. So this seems backward compatible. That's good. What I haven't tested yet that still needs testing: That I can install the resulting rpm and that Pig runs ok when I do. In particular I am interested to see how users run Pig when the RPM is installed. Do they still need to set PIG_CLASSPATH to point to their cluster or does the RPM installation take care of all that. That I can install the resulting deb and that Pig runs ok when I do. I don't have any debian based systems handy to test this. A general question as well. My initial impression was that there would be separate rpms and debs for binary and source distributions, as there are for maven packages. But it looking at the resulting rpms and debs it looks instead like source and executables have all been placed in one rpm/deb. Is that intentional? Is it what users expect? These things usually seem to be divided in rpms. The error message I get when building rpms on my mac: rpm: [mkdir] Created dir: /tmp/pig_package_build_gates/BUILD [mkdir] Created dir: /tmp/pig_package_build_gates/RPMS [mkdir] Created dir: /tmp/pig_package_build_gates/SRPMS [mkdir] Created dir: /tmp/pig_package_build_gates/SOURCES [mkdir] Created dir: /tmp/pig_package_build_gates/SPECS [copy] Copying 2 files to /tmp/pig_package_build_gates/SOURCES [copy] Copying 1 file to /tmp/pig_package_build_gates/SPECS [rpm] Building the RPM based on the pig.spec file [rpm] error: Unable to open temp file. [rpm] Unable to open temp file. [rpm] [rpm] [rpm] RPM build errors: BUILD FAILED /Users/gates/src/pig/top/1857/trunk/build.xml:885: '/opt/local/bin/rpmbuild' failed with exit code 1
          Hide
          Eric Yang added a comment -

          I have not tried to build the rpm on Mac port. From the error, it looks like the system was unable to create files in /tmp. There are three possible causes.

          1. /tmp is full.
          2. Disable java-repak did not work on MacPorts. There are bugs in rpmbuild which re-package java jar files to find dependencies. This is particularly a issue with software using aspect/j. Hence, I defined a new macro __os_install_post to overwrite the jar file repackaging. This macro was designed for RHEL/CentOS. I don't think this macro works on MacPorts.
          3. The source code location is in a location where the directory path contains multiple dash '-' character. rpmbuild has bug to use long file names with dash character. This is the reason that the build system designed to build RPM in /tmp to avoid using $src_prefix/build/rpm, where src_prefix has long file names.

          I think the probable cause is either 1 or 2. If you like to try to debug 2, then edit src/packages/rpm/spec/pig.spec and remove __os_install_post macro.

          For the untested part:

          1. PIG_CLASS_PATH should be properly setup by the script. User doesn't need to define class path.
          2. I would recommend to use VirtualBox and install Ubuntu. I could share my vm image for testing on monday.

          For Apache Hadoop rpm/deb packages, there is no deb source package because source tarball is a different target from top level.

          Here are the reasons why there is no source package:

          For debian, software are usually setup:

          1. Write a control file, and pre/post install/remove/upgrade scripts
          2. Patch source code
          3. Build source code
          4. Create deb package from binary
          5. Create source deb package with source file list and binary deb package

          For rpm, software are usually setup:

          1. Write a spec file
          2. Prepare source code and patches
          3. Build from patched source
          4. Discover dependency
          5. Generate source rpm and binary rpm

          In both systems, the meta data packaging file are created outside of the scope of source tarball, and patches can be applied to create patch level packages. In Apache, we have minor releases but not patch level release. Hence, the patching mechanism do not apply. Apache released source tarball is the source of truth. There is no need to create separate source package to represent the source.

          Show
          Eric Yang added a comment - I have not tried to build the rpm on Mac port. From the error, it looks like the system was unable to create files in /tmp. There are three possible causes. /tmp is full. Disable java-repak did not work on MacPorts. There are bugs in rpmbuild which re-package java jar files to find dependencies. This is particularly a issue with software using aspect/j. Hence, I defined a new macro __os_install_post to overwrite the jar file repackaging. This macro was designed for RHEL/CentOS. I don't think this macro works on MacPorts. The source code location is in a location where the directory path contains multiple dash '-' character. rpmbuild has bug to use long file names with dash character. This is the reason that the build system designed to build RPM in /tmp to avoid using $src_prefix/build/rpm, where src_prefix has long file names. I think the probable cause is either 1 or 2. If you like to try to debug 2, then edit src/packages/rpm/spec/pig.spec and remove __os_install_post macro. For the untested part: PIG_CLASS_PATH should be properly setup by the script. User doesn't need to define class path. I would recommend to use VirtualBox and install Ubuntu. I could share my vm image for testing on monday. For Apache Hadoop rpm/deb packages, there is no deb source package because source tarball is a different target from top level. Here are the reasons why there is no source package: For debian, software are usually setup: Write a control file, and pre/post install/remove/upgrade scripts Patch source code Build source code Create deb package from binary Create source deb package with source file list and binary deb package For rpm, software are usually setup: Write a spec file Prepare source code and patches Build from patched source Discover dependency Generate source rpm and binary rpm In both systems, the meta data packaging file are created outside of the scope of source tarball, and patches can be applied to create patch level packages. In Apache, we have minor releases but not patch level release. Hence, the patching mechanism do not apply. Apache released source tarball is the source of truth. There is no need to create separate source package to represent the source.
          Hide
          Eric Yang added a comment -

          PIG-1857-3.patch: Update debian dependency on Sun Java and Hadoop.

          Show
          Eric Yang added a comment - PIG-1857 -3.patch: Update debian dependency on Sun Java and Hadoop.
          Hide
          Eric Yang added a comment -

          Fix path for pig jar file in Debian package.

          Show
          Eric Yang added a comment - Fix path for pig jar file in Debian package.
          Hide
          Eric Yang added a comment -

          Remove template defined PIG_HOME.

          Show
          Eric Yang added a comment - Remove template defined PIG_HOME.
          Hide
          Daniel Dai added a comment -

          Tried PIG-1857-5.patch, works well in both ubuntu (deb) and redhat (rpm). Here is some notes about how it works:
          1. Install contrib, lib, scripts, test, tutorial into /usr/share/pig, install pig config into /etc/pig, install bin/pig into /usr/bin
          2. package bundles pig-withouthadoop.jar
          3. package depends on hadoop package
          4. bin/pig will put /etc/hadoop into classpath, which should be the hadoop config; It also link all hadoop jars

          For Alan's concern:

          That I can install the resulting rpm and that Pig runs ok when I do. In particular I am interested to see how users run Pig when the RPM is installed. Do they still need to set PIG_CLASSPATH to point to their cluster or does the RPM installation take care of all that.

          Yes, verified.

          That I can install the resulting deb and that Pig runs ok when I do. I don't have any debian based systems handy to test this.

          Yes, verified.

          A general question as well. My initial impression was that there would be separate rpms and debs for binary and source distributions, as there are for maven packages. But it looking at the resulting rpms and debs it looks instead like source and executables have all been placed in one rpm/deb. Is that intentional? Is it what users expect? These things usually seem to be divided in rpms.

          Talked with Eric, we only build binary packages. The rpm source package is not used and Eric will remove it.

          Show
          Daniel Dai added a comment - Tried PIG-1857 -5.patch, works well in both ubuntu (deb) and redhat (rpm). Here is some notes about how it works: 1. Install contrib, lib, scripts, test, tutorial into /usr/share/pig, install pig config into /etc/pig, install bin/pig into /usr/bin 2. package bundles pig-withouthadoop.jar 3. package depends on hadoop package 4. bin/pig will put /etc/hadoop into classpath, which should be the hadoop config; It also link all hadoop jars For Alan's concern: That I can install the resulting rpm and that Pig runs ok when I do. In particular I am interested to see how users run Pig when the RPM is installed. Do they still need to set PIG_CLASSPATH to point to their cluster or does the RPM installation take care of all that. Yes, verified. That I can install the resulting deb and that Pig runs ok when I do. I don't have any debian based systems handy to test this. Yes, verified. A general question as well. My initial impression was that there would be separate rpms and debs for binary and source distributions, as there are for maven packages. But it looking at the resulting rpms and debs it looks instead like source and executables have all been placed in one rpm/deb. Is that intentional? Is it what users expect? These things usually seem to be divided in rpms. Talked with Eric, we only build binary packages. The rpm source package is not used and Eric will remove it.
          Hide
          Daniel Dai added a comment -

          Some additional change we want to make:
          1. pig-withouthadoop.jar is renamed to pig-0.10.0-SNAPSHOT-core.jar in the package, which could be confusing
          2. Eventually we want to use pig-only.jar, link all libraries in lib dynamically

          All these changes require change other part of Pig code, which is outside the scope of this Jira. We will open separate tickets to address.

          Show
          Daniel Dai added a comment - Some additional change we want to make: 1. pig-withouthadoop.jar is renamed to pig-0.10.0-SNAPSHOT-core.jar in the package, which could be confusing 2. Eventually we want to use pig-only.jar, link all libraries in lib dynamically All these changes require change other part of Pig code, which is outside the scope of this Jira. We will open separate tickets to address.
          Hide
          Eric Yang added a comment -

          PIG-1857-6.patch: Build RPM binary package only.

          Show
          Eric Yang added a comment - PIG-1857 -6.patch: Build RPM binary package only.
          Hide
          Daniel Dai added a comment -

          PIG-1857-7.patch include some minor change in classpath sequence. Allow user to override hadoop conf with environment variable PIG_CLASSPATH

          Show
          Daniel Dai added a comment - PIG-1857 -7.patch include some minor change in classpath sequence. Allow user to override hadoop conf with environment variable PIG_CLASSPATH
          Hide
          Daniel Dai added a comment -

          Patch committed to trunk. Thanks Eric!

          Show
          Daniel Dai added a comment - Patch committed to trunk. Thanks Eric!

            People

            • Assignee:
              Eric Yang
              Reporter:
              Eric Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development