Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: general
    • Labels:
      None

      Description

      Crunch is a Java library for creating data pipelines on Hadoop that recently entered the Apache Incubator.

      1. 0001-BIGTOP-612.-Add-Crunch-to-Bigtop.patch
        19 kB
        Roman Shaposhnik
      2. BIGTOP-612.patch.txt
        18 kB
        Roman Shaposhnik

        Activity

        Hide
        Roman Shaposhnik added a comment -

        Josh, I'd love to add Crunch to Bigtop! Now, given that Bigtop 0.4.0 and beyond will be based on Hadoop 2.X code line it seems that crunch is not quite ready since it doesn't yet
        support YARN out of the box. Is there a JIRA on Crunch side that tracks providing support for it?

        Show
        Roman Shaposhnik added a comment - Josh, I'd love to add Crunch to Bigtop! Now, given that Bigtop 0.4.0 and beyond will be based on Hadoop 2.X code line it seems that crunch is not quite ready since it doesn't yet support YARN out of the box. Is there a JIRA on Crunch side that tracks providing support for it?
        Hide
        Josh Wills added a comment -

        Thanks Roman! We don't actually have JIRA setup yet for Apache Crunch, but as soon as we do, I will add a link to it here. Rest assured, it is one of our top priorities.

        Show
        Josh Wills added a comment - Thanks Roman! We don't actually have JIRA setup yet for Apache Crunch, but as soon as we do, I will add a link to it here. Rest assured, it is one of our top priorities.
        Hide
        Josh Wills added a comment -

        Hey Roman-- it turns out that support for Hadoop 2.0.0 in Crunch got done before our JIRA was ready to go. All of Crunch's unit tests pass when run against hadoop 2.0.0-alpha.

        Show
        Josh Wills added a comment - Hey Roman-- it turns out that support for Hadoop 2.0.0 in Crunch got done before our JIRA was ready to go. All of Crunch's unit tests pass when run against hadoop 2.0.0-alpha.
        Hide
        Roman Shaposhnik added a comment -

        Perfect! One last question – do you guys have any timeline for your first Apache Incubator release?

        Show
        Roman Shaposhnik added a comment - Perfect! One last question – do you guys have any timeline for your first Apache Incubator release?
        Hide
        Josh Wills added a comment -

        Nah, we're still getting infrastructure in place (for example, JIRA). I'll keep you posted.

        Show
        Josh Wills added a comment - Nah, we're still getting infrastructure in place (for example, JIRA). I'll keep you posted.
        Hide
        Josh Wills added a comment -

        @Roman first release is here:

        http://www.apache.org/dist/incubator/crunch/crunch-0.3.0-incubating/

        and it has support for building and running against hadoop-2.0.0.

        Show
        Josh Wills added a comment - @Roman first release is here: http://www.apache.org/dist/incubator/crunch/crunch-0.3.0-incubating/ and it has support for building and running against hadoop-2.0.0.
        Hide
        Roman Shaposhnik added a comment -

        Josh, any chance you can also publish the sources as tarballs? It is a silly thing, but it means one less dependency on a build machine for us (and it also may be an ASF requirement, but I'm not positive there).

        And here's another question: every project in Bigtop needs to be tested at least at the smoke test level. I see that crunch has 4 examples that we can use for that purpose, but is there any chance to utilize the other tests? Sort of what we do for pig?

        Show
        Roman Shaposhnik added a comment - Josh, any chance you can also publish the sources as tarballs? It is a silly thing, but it means one less dependency on a build machine for us (and it also may be an ASF requirement, but I'm not positive there). And here's another question: every project in Bigtop needs to be tested at least at the smoke test level. I see that crunch has 4 examples that we can use for that purpose, but is there any chance to utilize the other tests? Sort of what we do for pig?
        Hide
        Josh Wills added a comment -

        Re: #1, that is silly-- I'm assuming it's b/c the build machines don't come with zip/unzip installed by default?

        Re: #2, what do you do for Pig?

        Show
        Josh Wills added a comment - Re: #1, that is silly-- I'm assuming it's b/c the build machines don't come with zip/unzip installed by default? Re: #2, what do you do for Pig?
        Hide
        Roman Shaposhnik added a comment -

        #1, that is silly-- I'm assuming it's b/c the build machines don't come with zip/unzip installed by default?

        Yup, that's the default config. But then again, if Crunch doesn't publish tarballs that would be THE only project I know of in ASF that doesn't provide them. And I know quite a few (I know, I know – foolish consistency, but still )

        #2 what do you do for Pig?

        Pig unit tests have a nice capability of being able to run against real clusters (not just MiniMR). Will something like that be possible with Crunch's unit tests?

        Show
        Roman Shaposhnik added a comment - #1, that is silly-- I'm assuming it's b/c the build machines don't come with zip/unzip installed by default? Yup, that's the default config. But then again, if Crunch doesn't publish tarballs that would be THE only project I know of in ASF that doesn't provide them. And I know quite a few (I know, I know – foolish consistency, but still ) #2 what do you do for Pig? Pig unit tests have a nice capability of being able to run against real clusters (not just MiniMR). Will something like that be possible with Crunch's unit tests?
        Hide
        Josh Wills added a comment -

        Oh, those pesky hobgoblins. I don't see where tar.gz is a hard requirement for a release, although I empathize with the needs of build machines.

        I could conceive of running Crunch unit tests against a real cluster, although it would take some work-- we have a bunch of hooks in the unit tests so that they play nicely w/the jenkins machines (clean up after themselves, etc.) and we would need to adapt it for running on a real cluster.

        Before I take this stuff back to crunch-dev so we can discuss, are there any other needs from bigtop that we should be cognizant of?

        Show
        Josh Wills added a comment - Oh, those pesky hobgoblins. I don't see where tar.gz is a hard requirement for a release, although I empathize with the needs of build machines. I could conceive of running Crunch unit tests against a real cluster, although it would take some work-- we have a bunch of hooks in the unit tests so that they play nicely w/the jenkins machines (clean up after themselves, etc.) and we would need to adapt it for running on a real cluster. Before I take this stuff back to crunch-dev so we can discuss, are there any other needs from bigtop that we should be cognizant of?
        Hide
        Andrew Bayer added a comment -

        tar.gz is a hard requirement 'cos everything else is a tar.gz. =) It's really that simple, basically. The build infrastructure expects there to be tarballs to blow up.

        Show
        Andrew Bayer added a comment - tar.gz is a hard requirement 'cos everything else is a tar.gz. =) It's really that simple, basically. The build infrastructure expects there to be tarballs to blow up.
        Hide
        Josh Wills added a comment -

        Apologies-- to clarify, I meant that tar.gz was not a hard requirement for an Apache release.

        Show
        Josh Wills added a comment - Apologies-- to clarify, I meant that tar.gz was not a hard requirement for an Apache release.
        Hide
        Roman Shaposhnik added a comment -

        Josh, there's also an issue of binary convenience artifacts. Again, a question of consistency, perhaps but all of the other Hadoop ecosystem Apache Projects are publishing the binary convenience artifacts. It would be very nice if Crunch had a Maven assembly configured to take care of that during the release cycle so that the jars end up not only in Maven, but also in the tarballs/zip files published.

        Other than that it seems fine so far. One last question: when do you plan to have next release where perhaps some of these issues could be taken care of?

        Show
        Roman Shaposhnik added a comment - Josh, there's also an issue of binary convenience artifacts. Again, a question of consistency, perhaps but all of the other Hadoop ecosystem Apache Projects are publishing the binary convenience artifacts. It would be very nice if Crunch had a Maven assembly configured to take care of that during the release cycle so that the jars end up not only in Maven, but also in the tarballs/zip files published. Other than that it seems fine so far. One last question: when do you plan to have next release where perhaps some of these issues could be taken care of?
        Hide
        Josh Wills added a comment -

        Probably a month or so. I'll add some hooks to make more tests runnable on a real cluster as well. Thanks!

        Show
        Josh Wills added a comment - Probably a month or so. I'll add some hooks to make more tests runnable on a real cluster as well. Thanks!
        Hide
        Roman Shaposhnik added a comment -

        I'm attaching a patch for those who'd like to review it. Things that are still blocking us from including it into Bigtop:

        1. lack of publicly available crunch release tarball
        2. lack of any kind of integration tests for crunch (currently blocked on CRUNCH-68)

        Finally, it would be very nice if you could help us with CRUNCH-69

        Show
        Roman Shaposhnik added a comment - I'm attaching a patch for those who'd like to review it. Things that are still blocking us from including it into Bigtop: lack of publicly available crunch release tarball lack of any kind of integration tests for crunch (currently blocked on CRUNCH-68 ) Finally, it would be very nice if you could help us with CRUNCH-69
        Hide
        Bruno Mahé added a comment -

        Took a look at that patch and here are my comments:

        • +mv $BUILD_DIR/crunch-examples/target/apache-crunch-*-job.jar $PREFIX/$LIB_DIR/crunch-examples-job.jar
          +for subproject in crunch-hbase crunch-scrunch crunch-examples crunch ; do
          +  cp $BUILD_DIR/$subproject/target/apache-crunch*.jar $PREFIX/$LIB_DIR/$subproject.jar
          +done

          => Shouldn't the crunch-examples from the for loop be removed since it is already taken care of the line just before?

        • crunch-hbase should probably be in its own subpackage (depending on hbase client jar)
        • +Maintainer: Bigtop <bigtop-dev@incubator.apache.org>

          => Should be Apache Bigtop

        • bigtop-packages/src/deb/crunch/copyright

          => It's copy/pasted from datafu and should be redone completely

        • +#CRUNCH_SITE=https://github.com/downloads/linkedin/datafu

          => to be removed. Remove also the other commented lines. We have source control for that stuff

        Other than that, it looks pretty clean. I can't wait to be able to +1 it!

        Show
        Bruno Mahé added a comment - Took a look at that patch and here are my comments: +mv $BUILD_DIR/crunch-examples/target/apache-crunch-*-job.jar $PREFIX/$LIB_DIR/crunch-examples-job.jar +for subproject in crunch-hbase crunch-scrunch crunch-examples crunch ; do + cp $BUILD_DIR/$subproject/target/apache-crunch*.jar $PREFIX/$LIB_DIR/$subproject.jar +done => Shouldn't the crunch-examples from the for loop be removed since it is already taken care of the line just before? crunch-hbase should probably be in its own subpackage (depending on hbase client jar) +Maintainer: Bigtop <bigtop-dev@incubator.apache.org> => Should be Apache Bigtop bigtop-packages/src/deb/crunch/copyright => It's copy/pasted from datafu and should be redone completely +#CRUNCH_SITE=https://github.com/downloads/linkedin/datafu => to be removed. Remove also the other commented lines. We have source control for that stuff Other than that, it looks pretty clean. I can't wait to be able to +1 it!
        Hide
        Josh Wills added a comment -

        Hey Roman, if I actually want to run your patch and build an RPM, how do I go about doing it? Are there getting started docs somewhere that I'm missing?

        Show
        Josh Wills added a comment - Hey Roman, if I actually want to run your patch and build an RPM, how do I go about doing it? Are there getting started docs somewhere that I'm missing?
        Hide
        Roman Shaposhnik added a comment -

        You'll have to setup a Bigtop dev. environment, checkout the Bigtop trunk, apply the patch and run make crunch-rpm. More details on how to set things up on Ubuntu (but it should translate somewhat to RedHat): https://cwiki.apache.org/confluence/display/BIGTOP/Building+Bigtop+on+Ubuntu

        We're actually working on making dev. environment easier to setup, but for now – you're kind of stuck with figuring a few things out on your own: http://mail-archives.apache.org/mod_mbox/incubator-bigtop-dev/201209.mbox/%3C1348102836.54059.YahooMailNeo%40web161305.mail.bf1.yahoo.com%3E

        Show
        Roman Shaposhnik added a comment - You'll have to setup a Bigtop dev. environment, checkout the Bigtop trunk, apply the patch and run make crunch-rpm. More details on how to set things up on Ubuntu (but it should translate somewhat to RedHat): https://cwiki.apache.org/confluence/display/BIGTOP/Building+Bigtop+on+Ubuntu We're actually working on making dev. environment easier to setup, but for now – you're kind of stuck with figuring a few things out on your own: http://mail-archives.apache.org/mod_mbox/incubator-bigtop-dev/201209.mbox/%3C1348102836.54059.YahooMailNeo%40web161305.mail.bf1.yahoo.com%3E
        Hide
        Andrew Purtell added a comment -

        Josh Wills I use a VM running CentOS 6 to build Bigtop RPM packages. Email me if you get stuck.

        Show
        Andrew Purtell added a comment - Josh Wills I use a VM running CentOS 6 to build Bigtop RPM packages. Email me if you get stuck.
        Hide
        Roman Shaposhnik added a comment -

        Bruno Mahé Bruno, I'm attaching a changed patch that:

        1. is now based on an officialy released Crunch 0.4.0
        2. should take care of pretty much all of your concerns
        Show
        Roman Shaposhnik added a comment - Bruno Mahé Bruno, I'm attaching a changed patch that: is now based on an officialy released Crunch 0.4.0 should take care of pretty much all of your concerns

          People

          • Assignee:
            Roman Shaposhnik
            Reporter:
            Josh Wills
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development