Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-557

Fix file distribution from HDFS in Crunch-on-Spark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      From the user list:

      I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not tried to do any configuration changes but I did run tests with datasets of different sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.

      [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
      [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff

      The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme for the URI it's handing to Spark.

      Attachments

        1. CRUNCH-557.patch
          1 kB
          Josh Wills
        2. CRUNCH-557a.patch
          1 kB
          Surbhi Mungre
        3. CRUNCH-557b.patch
          1 kB
          Josh Wills

        Activity

          People

            Unassigned Unassigned
            jwills Josh Wills
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: