Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1145

Remove classifier from shaded spark runner artifact

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 0.5.0
    • runner-spark
    • None

    Description

      Shade plugin configured in spark runner's pom adds a classifier to spark runner shaded jar

      <shadedArtifactAttached>true</shadedArtifactAttached>
      <shadedClassifierName>spark-app</shadedClassifierName>
      

      This means, that in order for a user application that is dependent on spark-runner to work in cluster mode, they have to add the classifier in their dependency declaration, like so:

              <dependency>
                  <groupId>org.apache.beam</groupId>
                  <artifactId>beam-runners-spark</artifactId>
                  <version>0.4.0-incubating-SNAPSHOT</version>
                  <classifier>spark-app</classifier>
              </dependency>
      

      Otherwise, if they do not specify classifier, the jar they get is unshaded, which in cluster mode, causes collisions between different guava versions.
      Example exception in cluster mode when adding the dependency without classifier:

      16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, ***********.com): java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
      	at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137)
      	at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98)
      	at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
      	at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179)
      	at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
      	at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
      	at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
      	at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
      	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
      	at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
      	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
      	at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:89)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      I would suggest that the classifier be removed from the shaded jar, to avoid confusion among users, and have a better user experience.

      P.S. Looks like Dataflow runner does not add a classifier to its shaded jar.

      Attachments

        Activity

          People

            aviemzur Aviem Zur
            aviemzur Aviem Zur
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: