Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.1, 0.5.0
    • Fix Version/s: 0.7.0
    • Component/s: debian, rpm

      Description

      It would be great to add Spark native packages to the BigTop.
      Spark project is in-memory fast data analytic project from Berkeley (www.spark-project.org)

      1. BIGTOP-715.patch
        19 kB
        MTG dev
      2. BIGTOP-715.spark7.patch
        21 kB
        Plamen Jeliazkov
      3. BIGTOP-715.spark7.patch
        20 kB
        Konstantin Boudnik
      4. BIGTOP-715.spark7.patch
        20 kB
        Konstantin Boudnik
      5. BIGTOP-715.master-0.8.patch
        29 kB
        Konstantin Boudnik
      6. BIGTOP-715.master-0.8.patch
        37 kB
        Konstantin Boudnik
      7. BIGTOP-715.master-0.8.patch
        35 kB
        Konstantin Boudnik
      8. BIGTOP-715.master-0.8.patch
        34 kB
        Konstantin Boudnik
      9. BIGTOP-715.master-0.8.patch
        34 kB
        Konstantin Boudnik
      10. BIGTOP-715.master-0.8.patch
        34 kB
        Konstantin Boudnik

        Issue Links

          Activity

          Hide
          Konstantin Boudnik added a comment -

          Erich, I have just fixed the file suffix issues on BIGTOP-1051, the commit is attributed to you (although the JIRA isn't). Thanks for the good catch.

          Show
          Konstantin Boudnik added a comment - Erich, I have just fixed the file suffix issues on BIGTOP-1051 , the commit is attributed to you (although the JIRA isn't). Thanks for the good catch.
          Hide
          Konstantin Boudnik added a comment - - edited

          Could you open a JIRA ticket for the suffix problem and provide a patch (it seems to be working in a quick test I did)? I will commit it. Thanks.

          Debian package needs more work - see the discussion above.

          Spark is version tolerant - Scala isn't ;( Odersky and company need to learn how to make production ready software

          it builds with 2.9.2, I think, but it won't run - there's new classes added to 2.9.3 (for a feature officially introduced in 2.10) that aren't available in 2.9.2.

          Show
          Konstantin Boudnik added a comment - - edited Could you open a JIRA ticket for the suffix problem and provide a patch (it seems to be working in a quick test I did)? I will commit it. Thanks. Debian package needs more work - see the discussion above. Spark is version tolerant - Scala isn't ;( Odersky and company need to learn how to make production ready software it builds with 2.9.2, I think, but it won't run - there's new classes added to 2.9.3 (for a feature officially introduced in 2.10) that aren't available in 2.9.2.
          Hide
          Erich Schubert added a comment - - edited

          With the changes reported above (this wasn't an actual patch file), the build still fails with

          Missing param: SOURCE_DIR
          usage: debian/install_spark.sh

          i.e. install_spark.sh expects --source-dir to be set, which the deb build script doesn't provide (`pwd` probably works, i.e. --source-dir = --build-dir).

          Thanks for the scala version pointer. That means I'll postpone any spark experiments for now. I hope that at some point it will become more version tolerant and easier to install.

          Due to maven, parts of scala 2.9.3 were installed - enough to compile, actually.

          It's one of the things I really hate about maven, that it clutters your system this easily. I really wanted to build using my system installed libraries. There is no use in having a mixed 2.9.2 + 2.9.3 scala.

          Show
          Erich Schubert added a comment - - edited With the changes reported above (this wasn't an actual patch file), the build still fails with Missing param: SOURCE_DIR usage: debian/install_spark.sh i.e. install_spark.sh expects --source-dir to be set, which the deb build script doesn't provide (`pwd` probably works, i.e. --source-dir = --build-dir). Thanks for the scala version pointer. That means I'll postpone any spark experiments for now. I hope that at some point it will become more version tolerant and easier to install. Due to maven, parts of scala 2.9.3 were installed - enough to compile, actually. It's one of the things I really hate about maven, that it clutters your system this easily. I really wanted to build using my system installed libraries. There is no use in having a mixed 2.9.2 + 2.9.3 scala.
          Hide
          Konstantin Boudnik added a comment - - edited

          Good spotting for the tarball issue - it was working in my case, but I have quite bit of custom configurations in my environment, so I am not surprised it isn't universal.

          WRT Scala: Spark requires scala 2.9.3 - no earlier, nor later - and being not sure which version is packaged by default the build requires to set it explicitly. Yeah, Scala is weird, I know.

          Show
          Konstantin Boudnik added a comment - - edited Good spotting for the tarball issue - it was working in my case, but I have quite bit of custom configurations in my environment, so I am not surprised it isn't universal. WRT Scala: Spark requires scala 2.9.3 - no earlier, nor later - and being not sure which version is packaged by default the build requires to set it explicitly. Yeah, Scala is weird, I know.
          Hide
          Erich Schubert added a comment -

          Spark packaging is still busted.

          It downloads

          https://github.com/mesos/spark/archive/master.zip

          and saves this .zip file as spark-0.8.0-SNAPSHOT.tar.gz

          but that doesn't make it a valid .tar.gz archive. Unless your tar automatically falls back to zip (mine doesn't) the build will fail.

          Bug fix is simple:

          -SPARK_TARBALL_SRC=master.zip
          +SPARK_TARBALL_SRC=master.tar.gz

          It would also be nice if there was a default
          SCALA_HOME=/usr/share/java
          for Ubuntu and Debian users (or any other distribution which ships scala in standard paths)

          Show
          Erich Schubert added a comment - Spark packaging is still busted. It downloads https://github.com/mesos/spark/archive/master.zip and saves this .zip file as spark-0.8.0-SNAPSHOT.tar.gz but that doesn't make it a valid .tar.gz archive. Unless your tar automatically falls back to zip (mine doesn't) the build will fail. Bug fix is simple: -SPARK_TARBALL_SRC=master.zip +SPARK_TARBALL_SRC=master.tar.gz It would also be nice if there was a default SCALA_HOME=/usr/share/java for Ubuntu and Debian users (or any other distribution which ships scala in standard paths)
          Hide
          Konstantin Boudnik added a comment -

          Pushed to the trunk. Thanks for the review, Peter!

          Show
          Konstantin Boudnik added a comment - Pushed to the trunk. Thanks for the review, Peter!
          Hide
          Peter Linnell added a comment -

          Looks great. Agreed on the attributes. +1 on pushing it. Thanks.

          Show
          Peter Linnell added a comment - Looks great. Agreed on the attributes. +1 on pushing it. Thanks.
          Hide
          Konstantin Boudnik added a comment -

          Peter,

          I have addressed most of your comments but the one with attributes in the spec file.
          The current way works just fine and I had some clumsy issues trying to setup the attribute for every single file. So, if you don't feel strong about this - let's leave it for now, perhaps?

          I am ready to commit if there's no more feedback from anyone.

          Show
          Konstantin Boudnik added a comment - Peter, I have addressed most of your comments but the one with attributes in the spec file. The current way works just fine and I had some clumsy issues trying to setup the attribute for every single file. So, if you don't feel strong about this - let's leave it for now, perhaps? I am ready to commit if there's no more feedback from anyone.
          Hide
          Peter Linnell added a comment -

          +1 LGTM

          Nits:

          we should use /bin/bash not bin/sh this has been in other discussions
          Line 1040 do we need :

          %attr(0755,root,root)

          so that:
          +%

          {bin}/spark-shell
          +%{bin}

          /spark-executor have the right permissions? Better to set them explicitly IMO.

          Not clear about the Scala version comment. Can we depend on a system installed version ? If so, we could list this as Buildrequires or Requires in the spec file.

          Show
          Peter Linnell added a comment - +1 LGTM Nits: we should use /bin/bash not bin/sh this has been in other discussions Line 1040 do we need : %attr(0755,root,root) so that: +% {bin}/spark-shell +%{bin} /spark-executor have the right permissions? Better to set them explicitly IMO. Not clear about the Scala version comment. Can we depend on a system installed version ? If so, we could list this as Buildrequires or Requires in the spec file.
          Hide
          Konstantin Boudnik added a comment -

          The patch is finally ready for review - appreciate the input.

          Show
          Konstantin Boudnik added a comment - The patch is finally ready for review - appreciate the input.
          Hide
          Konstantin Boudnik added a comment -

          Examples had to be removed from the assembly.

          Show
          Konstantin Boudnik added a comment - Examples had to be removed from the assembly.
          Hide
          Konstantin Boudnik added a comment - - edited

          new MLlib component has been added. New version of the patch addresses this change.
          Also, this version now works against the HEAD of the master branch.

          Show
          Konstantin Boudnik added a comment - - edited new MLlib component has been added. New version of the patch addresses this change. Also, this version now works against the HEAD of the master branch.
          Hide
          Konstantin Boudnik added a comment -

          I think I got most if not all thing right this time. I have built
          the package, installed, run -master and -worker services, connected spark-shell to the cluster, ran some analysis using HDFS files; uninstalled the package and made sure that the services are shutdown.

          Further work will include splitting of the package in a big 'master' and a way smaller 'worker' package depending on it. That, as well as DEB package development can be done later.

          Show
          Konstantin Boudnik added a comment - I think I got most if not all thing right this time. I have built the package, installed, run -master and -worker services, connected spark-shell to the cluster, ran some analysis using HDFS files; uninstalled the package and made sure that the services are shutdown. Further work will include splitting of the package in a big 'master' and a way smaller 'worker' package depending on it. That, as well as DEB package development can be done later.
          Hide
          Konstantin Boudnik added a comment -

          Or perhaps it really makes sense to push it - thanks for the review BTW - and follow up with some improvements JIRAs.

          Show
          Konstantin Boudnik added a comment - Or perhaps it really makes sense to push it - thanks for the review BTW - and follow up with some improvements JIRAs.
          Hide
          Konstantin Boudnik added a comment -

          Agree. And at this particular moment I don't even use that redistributed version - I am using Java runner. I guess I keep it there just in case... but it is a wrong case, it seems I will remove it in the following version of the patch, where I will also add service control scripts.

          Show
          Konstantin Boudnik added a comment - Agree. And at this particular moment I don't even use that redistributed version - I am using Java runner. I guess I keep it there just in case... but it is a wrong case, it seems I will remove it in the following version of the patch, where I will also add service control scripts.
          Hide
          Peter Linnell added a comment -

          +1 It looks good enough to push.

          Nit: bin/sh should be bin/bash

          Nuking the scala reditribution would be a plus but not a blocker IMO. It should eventually be done though as it is a cleaner way to have scala.

          Show
          Peter Linnell added a comment - +1 It looks good enough to push. Nit: bin/sh should be bin/bash Nuking the scala reditribution would be a plus but not a blocker IMO. It should eventually be done though as it is a cleaner way to have scala.
          Hide
          Konstantin Boudnik added a comment -

          Attaching the patch that creates correct RPM package with latest Spark project master. The package can really be executed without scala runtime being set on the target machine, so we might consider removing the Scala redistribution by this package.

          A couple of TODOs:

          • current the build would only works off my spark branch here https://github.com/c0s/spark/tree/assembly. The master branch will be usable once this PR https://github.com/mesos/spark/pull/675 is merged
          • package requires hadoop-hdfs, hadoop-yarn, hadoop-mapreduce to be install in order to reuse Hadoop specific libraries
          • a couple of improvements are needed:
            • start scripts to bring up Master and worker nodes need to be added
            • DEB specific files need to be fixes a little bit (most of the work is done in install script anyway)
            • man page needs to be created
            • components jars don't need to be unpacked into separate directories and can be put into the root dir of the package.

          Comments are very welcome.

          Show
          Konstantin Boudnik added a comment - Attaching the patch that creates correct RPM package with latest Spark project master. The package can really be executed without scala runtime being set on the target machine, so we might consider removing the Scala redistribution by this package. A couple of TODOs: current the build would only works off my spark branch here https://github.com/c0s/spark/tree/assembly . The master branch will be usable once this PR https://github.com/mesos/spark/pull/675 is merged package requires hadoop-hdfs, hadoop-yarn, hadoop-mapreduce to be install in order to reuse Hadoop specific libraries a couple of improvements are needed: start scripts to bring up Master and worker nodes need to be added DEB specific files need to be fixes a little bit (most of the work is done in install script anyway) man page needs to be created components jars don't need to be unpacked into separate directories and can be put into the root dir of the package. Comments are very welcome.
          Hide
          Konstantin Boudnik added a comment - - edited

          .bq Stupid question
          Not really stupid, because it might be a way to go for Spark's mavenization. However, I am not sure if the needed version of Scala - 2.9.2 - is supported, etc. I will look at it in the Spark land though

          I will address your other comments and re-upload the patch. Thanks for reviewing!

          Show
          Konstantin Boudnik added a comment - - edited .bq Stupid question Not really stupid, because it might be a way to go for Spark's mavenization. However, I am not sure if the needed version of Scala - 2.9.2 - is supported, etc. I will look at it in the Spark land though I will address your other comments and re-upload the patch. Thanks for reviewing!
          Hide
          Bruno Mahé added a comment -
          • Stupid question: would making spark using the maven scala plugin help? So this removes the need to have scala pre-installed.
          • I believe the JAVA_HOME autodetect needs to be updated.
          • Apache Bigtop URLs listed in the packaging is pointing to the incubator. They are out of date.

          Other than that, it looks pretty nice.

          Show
          Bruno Mahé added a comment - Stupid question: would making spark using the maven scala plugin help? So this removes the need to have scala pre-installed. I believe the JAVA_HOME autodetect needs to be updated. Apache Bigtop URLs listed in the packaging is pointing to the incubator. They are out of date. Other than that, it looks pretty nice.
          Hide
          Konstantin Boudnik added a comment -

          This version of the patch takes advantage of Spark's Maven build.
          A fat problem: Spark maven build produces a fat-jar using Shader's plugin. Apparently, it is a pretty sub-optimal way to make an "assembly" and it has to be fixed in the spark land, essentially.

          For now, I can successfully produce RPM (DEB to be worked out).

          Show
          Konstantin Boudnik added a comment - This version of the patch takes advantage of Spark's Maven build. A fat problem: Spark maven build produces a fat-jar using Shader's plugin. Apparently, it is a pretty sub-optimal way to make an "assembly" and it has to be fixed in the spark land, essentially. For now, I can successfully produce RPM (DEB to be worked out).
          Hide
          Konstantin Boudnik added a comment -

          A bit improved version of the patch that uses hacks in the project file and still uses sbt to build.

          Show
          Konstantin Boudnik added a comment - A bit improved version of the patch that uses hacks in the project file and still uses sbt to build.
          Hide
          Konstantin Boudnik added a comment - - edited

          So, in other words, Spark didn't integrate YARN work into the mainline yet. Hmm... I guess they might need some help here!

          Show
          Konstantin Boudnik added a comment - - edited So, in other words, Spark didn't integrate YARN work into the mainline yet. Hmm... I guess they might need some help here!
          Hide
          Plamen Jeliazkov added a comment -

          do-component-build was missing a setting of HADOOP_MAJOR_VERSION to "2" instead of "1" (which is what the mvn -Phadoop2 profile does for you). However, this doesn't really change the issue of Spark not building under Apache Hadoop 2.

          Show
          Plamen Jeliazkov added a comment - do-component-build was missing a setting of HADOOP_MAJOR_VERSION to "2" instead of "1" (which is what the mvn -Phadoop2 profile does for you). However, this doesn't really change the issue of Spark not building under Apache Hadoop 2.
          Hide
          Plamen Jeliazkov added a comment -

          Some issues: Spark seems to have added "Hadoop 2" support, but this is only CDH4 with MR1 support really. Spark also has a maven build that you can trigger with a -Phadoop2 profile, however this does not solve the issue below.
          Spark depends on hadoop-core.jar and hadoop-client.jar. Apache Hadoop 2 does not have hadoop-core anymore, nor hadoop-client I believe.

          Not sure how to fix that without fixing it from within Spark itself.

          Show
          Plamen Jeliazkov added a comment - Some issues: Spark seems to have added "Hadoop 2" support, but this is only CDH4 with MR1 support really. Spark also has a maven build that you can trigger with a -Phadoop2 profile, however this does not solve the issue below. Spark depends on hadoop-core.jar and hadoop-client.jar. Apache Hadoop 2 does not have hadoop-core anymore, nor hadoop-client I believe. Not sure how to fix that without fixing it from within Spark itself.
          Hide
          Plamen Jeliazkov added a comment -

          Agreed – I'll pick up more of this tonight.

          Show
          Plamen Jeliazkov added a comment - Agreed – I'll pick up more of this tonight.
          Hide
          Peter Linnell added a comment -

          I've looked at the RPM side of the spec and overall looks ok: Nits:

          Please sed/nuke:

          Index: bigtop-packages/src/rpm/spark/SPECS/spark.spec
          IDEA additional info:
          Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
          <+>UTF

          Rpm might rightfully whine about unknown parameters as such and they really do not belong anyways.

          Thanks for working on this! Spark is a nice addition.

          Show
          Peter Linnell added a comment - I've looked at the RPM side of the spec and overall looks ok: Nits: Please sed/nuke: Index: bigtop-packages/src/rpm/spark/SPECS/spark.spec IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF Rpm might rightfully whine about unknown parameters as such and they really do not belong anyways. Thanks for working on this! Spark is a nice addition.
          Hide
          Konstantin Boudnik added a comment - - edited

          The patch for trunk should build Spark against Hadoop 2.0.4
          Right night do-component-build is building it against 1.0.3
          Correct version of all the components in the stack can be imported by
          . `dirname $0`/bigtop.bom on the top of do-component-build

          There are some commented out lines that can be safely removed as well.

          Show
          Konstantin Boudnik added a comment - - edited The patch for trunk should build Spark against Hadoop 2.0.4 Right night do-component-build is building it against 1.0.3 Correct version of all the components in the stack can be imported by . `dirname $0`/bigtop.bom on the top of do-component-build There are some commented out lines that can be safely removed as well.
          Hide
          Konstantin Boudnik added a comment -

          Plamen, Mesos is required when Spark runs on a real cluster. For a single node, Spark can be run as a standalone application or on top of HDFS. Adding Mesos would be great, but I think it can be done separately.

          Show
          Konstantin Boudnik added a comment - Plamen, Mesos is required when Spark runs on a real cluster. For a single node, Spark can be run as a standalone application or on top of HDFS. Adding Mesos would be great, but I think it can be done separately.
          Hide
          Plamen Jeliazkov added a comment -

          I've attached a similar patch Cos's but updates certain values for using Spark 0.7.

          Show
          Plamen Jeliazkov added a comment - I've attached a similar patch Cos's but updates certain values for using Spark 0.7.
          Hide
          Plamen Jeliazkov added a comment -

          I put the patch on top of trunk and I was able to get spark RPM creation on CentOS 6.3. Cos, is this package including Mesos? I see some lines in the RPM creation that look like "mesos-spark" but I don't see in the code that we are downloading Mesos from anywhere.

          This will also require Scala to already be installed currently – which is fine I believe (not difficult to do anyway).

          Show
          Plamen Jeliazkov added a comment - I put the patch on top of trunk and I was able to get spark RPM creation on CentOS 6.3. Cos, is this package including Mesos? I see some lines in the RPM creation that look like "mesos-spark" but I don't see in the code that we are downloading Mesos from anywhere. This will also require Scala to already be installed currently – which is fine I believe (not difficult to do anyway).
          Hide
          Plamen Jeliazkov added a comment -

          I believe it should require running both Mesos and HDFS. Was Mesos going to be an addition to this or separate? Spark requires it.

          Show
          Plamen Jeliazkov added a comment - I believe it should require running both Mesos and HDFS. Was Mesos going to be an addition to this or separate? Spark requires it.
          Hide
          Konstantin Boudnik added a comment -

          No Plamen, I haven't started anything yet. I think for the first round we don't need anything fancy, perhaps a simple word-count would do?

          val file = sc.textFile("file:/etc/passwd")
          val rootUser = file.filter(line => line.contains("root"))
          rootUser.count()
          

          or something a bit more advanced that would require a running HDFS?

          Show
          Konstantin Boudnik added a comment - No Plamen, I haven't started anything yet. I think for the first round we don't need anything fancy, perhaps a simple word-count would do? val file = sc.textFile("file:/etc/passwd") val rootUser = file.filter(line => line.contains("root")) rootUser.count() or something a bit more advanced that would require a running HDFS?
          Hide
          Plamen Jeliazkov added a comment -

          I would like to assist in this if I can. Did you already have some integration tests coming around Cos?

          Show
          Plamen Jeliazkov added a comment - I would like to assist in this if I can. Did you already have some integration tests coming around Cos?
          Hide
          Roman Shaposhnik added a comment -

          0.7 looks interesting!

          Show
          Roman Shaposhnik added a comment - 0.7 looks interesting!
          Hide
          Konstantin Boudnik added a comment -

          I will add some integration tests to support the validation of the Spark software. Also, do we want to go all the way up to Spark 0.7?

          Show
          Konstantin Boudnik added a comment - I will add some integration tests to support the validation of the Spark software. Also, do we want to go all the way up to Spark 0.7?
          Hide
          Konstantin Boudnik added a comment -

          Looks like the commit to the branch has been missed during the transition from SVN to Git. Just pushed it again to branch-0.3.1

          Show
          Konstantin Boudnik added a comment - Looks like the commit to the branch has been missed during the transition from SVN to Git. Just pushed it again to branch-0.3.1
          Hide
          Konstantin Boudnik added a comment -

          Just committed it to 0.3.1 branch. Trunk is coming shortly.

          Show
          Konstantin Boudnik added a comment - Just committed it to 0.3.1 branch. Trunk is coming shortly.
          Hide
          Konstantin Boudnik added a comment - - edited

          I think long-term we should rely on the system-provided package, but apparently there's not much urge in the Linux community to support this environment (I can't blame them for it

          Show
          Konstantin Boudnik added a comment - - edited I think long-term we should rely on the system-provided package, but apparently there's not much urge in the Linux community to support this environment (I can't blame them for it
          Hide
          Roman Shaposhnik added a comment -

          Great stuff! I guess the #1 question I have is what to do with the Scala dependency long-term. Thoughts?

          Show
          Roman Shaposhnik added a comment - Great stuff! I guess the #1 question I have is what to do with the Scala dependency long-term. Thoughts?
          Hide
          MTG dev added a comment -

          Here's the patch that works both for 0.3.1 branch and trunk.

          Show
          MTG dev added a comment - Here's the patch that works both for 0.3.1 branch and trunk.
          Hide
          Konstantin Boudnik added a comment -

          I want to include it into upcoming 0.3.1

          Show
          Konstantin Boudnik added a comment - I want to include it into upcoming 0.3.1
          Hide
          MTG dev added a comment -

          any reason not to support trunk

          No, not particularly - we just happen to have 1.x version of stack, so that's naturally was an easy choice for us to build Spark along side.

          Show
          MTG dev added a comment - any reason not to support trunk No, not particularly - we just happen to have 1.x version of stack, so that's naturally was an easy choice for us to build Spark along side.
          Hide
          Roman Shaposhnik added a comment -

          It seems that your current effort is focused on Bigtop 0.3 branch – any reason not to support trunk (essentially Hadoop 2.X HDFS, since I believe Spark has no dependency on YARN/MR)?

          Show
          Roman Shaposhnik added a comment - It seems that your current effort is focused on Bigtop 0.3 branch – any reason not to support trunk (essentially Hadoop 2.X HDFS, since I believe Spark has no dependency on YARN/MR)?

            People

            • Assignee:
              Konstantin Boudnik
              Reporter:
              MTG dev
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development