Details

    • Type: Wish
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: sdk-ideas
    • Labels:
      None

      Activity

      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      Thanks for the update. Beam is updating its Spark runner to 2.1 built with Scala 2.11 as well. https://issues.apache.org/jira/browse/BEAM-1920 I'll test scio integration with Spark runner once that jira issue is done.

      Show
      yuchaoran2011 Chaoran Yu added a comment - Thanks for the update. Beam is updating its Spark runner to 2.1 built with Scala 2.11 as well. https://issues.apache.org/jira/browse/BEAM-1920 I'll test scio integration with Spark runner once that jira issue is done.
      Hide
      sinisa_lyh Neville Li added a comment -

      Yes that ecosystem has too many build params, scala version, spark version, hadoop version, etc.
      2.10 is outdated, quite different from 2.11/2.12 and hard to maintain. That's why we stopped supporting it. 2.12 support should be available soon once some compiler lambda serialization issues are addressed.

      Show
      sinisa_lyh Neville Li added a comment - Yes that ecosystem has too many build params, scala version, spark version, hadoop version, etc. 2.10 is outdated, quite different from 2.11/2.12 and hard to maintain. That's why we stopped supporting it. 2.12 support should be available soon once some compiler lambda serialization issues are addressed.
      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      Thanks Neville for the information! I tried Spark 1.6.3 but it had Scala incompatibilities with Scio. The version of Spark 1.6.3 included in Beam Spark runner is compiled using Scala 2.10, but Scio is compiled using Scala 2.11. I had to change a few other dependencies to 2.10-compiled version such as https://mvnrepository.com/artifact/me.lyh/protobuf-generic_2.10, but still got errors. My team and myself will be thinking about contributing to get Spark runner to fully work with scio when we can spend more time on the project.

      Show
      yuchaoran2011 Chaoran Yu added a comment - Thanks Neville for the information! I tried Spark 1.6.3 but it had Scala incompatibilities with Scio. The version of Spark 1.6.3 included in Beam Spark runner is compiled using Scala 2.10, but Scio is compiled using Scala 2.11. I had to change a few other dependencies to 2.10-compiled version such as https://mvnrepository.com/artifact/me.lyh/protobuf-generic_2.10 , but still got errors. My team and myself will be thinking about contributing to get Spark runner to fully work with scio when we can spend more time on the project.
      Hide
      sinisa_lyh Neville Li added a comment -

      Looks like Spark runner still depends on 1.6.3. Can you give Spark 1.6 a shot instead?
      https://mvnrepository.com/artifact/org.apache.beam/beam-runners-spark/0.6.0

      We'd love to support all runners but we use Dataflow runner only and vanilla Spark. Contributions will be awesome and are definitely welcome. Feel free to submit issues or PRs on our GH repo. There's also a GItter room and a Google group for discussions.
      https://github.com/spotify/scio

      Show
      sinisa_lyh Neville Li added a comment - Looks like Spark runner still depends on 1.6.3. Can you give Spark 1.6 a shot instead? https://mvnrepository.com/artifact/org.apache.beam/beam-runners-spark/0.6.0 We'd love to support all runners but we use Dataflow runner only and vanilla Spark. Contributions will be awesome and are definitely welcome. Feel free to submit issues or PRs on our GH repo. There's also a GItter room and a Google group for discussions. https://github.com/spotify/scio
      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      After including SparkRunner dependency, I got the following exception "java.lang.NoClassDefFoundError: org/apache/spark/streaming/api/java/JavaStreamingContextFactory", which is a class that has ceased to exist for Spark 2.0+. I'm sure by digging deeper, the word count example can be run with SparkRunner. But my idea is that currently it's still not as simple as it should be. Spark runner, or basically most other runners supported by Beam, should receive the same support from scio as is the case for GCD runner. Is this an aspect where Spotify would like to see contributions from the community?

      Show
      yuchaoran2011 Chaoran Yu added a comment - After including SparkRunner dependency, I got the following exception "java.lang.NoClassDefFoundError: org/apache/spark/streaming/api/java/JavaStreamingContextFactory", which is a class that has ceased to exist for Spark 2.0+. I'm sure by digging deeper, the word count example can be run with SparkRunner. But my idea is that currently it's still not as simple as it should be. Spark runner, or basically most other runners supported by Beam, should receive the same support from scio as is the case for GCD runner. Is this an aspect where Spotify would like to see contributions from the community?
      Hide
      sinisa_lyh Neville Li added a comment -

      You need the spark runner dependency which is not included by default.

      Show
      sinisa_lyh Neville Li added a comment - You need the spark runner dependency which is not included by default.
      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      Neville Li
      I tried the latest SNAPSHOT version of scio but it's still not working with Spark runner out of the box. For example, running sbt "scio-examples/run-main com.spotify.scio.examples.WordCount --runner=SparkRunner --input=README.md --output=wc" gave me the following exception:
      java.lang.IllegalArgumentException: Unknown 'runner' specified 'SparkRunner', supported pipeline runners [DataflowRunner, DirectRunner]
      Caused by: java.lang.ClassNotFoundException: SparkRunner
      The same thing happened in the scio repl. Looking at the code, more things need to be done to integrate with Spark/Flink etc.

      Show
      yuchaoran2011 Chaoran Yu added a comment - Neville Li I tried the latest SNAPSHOT version of scio but it's still not working with Spark runner out of the box. For example, running sbt "scio-examples/run-main com.spotify.scio.examples.WordCount --runner=SparkRunner --input=README.md --output=wc" gave me the following exception: java.lang.IllegalArgumentException: Unknown 'runner' specified 'SparkRunner', supported pipeline runners [DataflowRunner, DirectRunner] Caused by: java.lang.ClassNotFoundException: SparkRunner The same thing happened in the scio repl. Looking at the code, more things need to be done to integrate with Spark/Flink etc.
      Hide
      sinisa_lyh Neville Li added a comment -

      We prefer to keep it separate for now mainly for logistics reasons:

      • we use SBT with lots of custom logic
      • we release very often, once every 1-2 weeks
      • we monkey patch Beam bugs, test in our production jobs, before upstream Beam release
      • we use a lightweight collaboration model, mainly just Github issues & PRs
      • there're only 3 Scio developers at Spotify supporting 150+ internal users and many external ones, all running on Dataflow

      However I also want to point out that nothing should stop those interested from trying it out or contributing

      • we decoupled Dataflow runner as much as possible
      • Scio should run on other runners without modification, just a matter of changing dependencies and arguments
      • there're still parts coupled with GCP and Dataflow runner but hopefully we can gradually decouple them as the file system and other related API improves
      • it'd be great to see bug reports and PRs from the community
      Show
      sinisa_lyh Neville Li added a comment - We prefer to keep it separate for now mainly for logistics reasons: we use SBT with lots of custom logic we release very often, once every 1-2 weeks we monkey patch Beam bugs, test in our production jobs, before upstream Beam release we use a lightweight collaboration model, mainly just Github issues & PRs there're only 3 Scio developers at Spotify supporting 150+ internal users and many external ones, all running on Dataflow However I also want to point out that nothing should stop those interested from trying it out or contributing we decoupled Dataflow runner as much as possible Scio should run on other runners without modification, just a matter of changing dependencies and arguments there're still parts coupled with GCP and Dataflow runner but hopefully we can gradually decouple them as the file system and other related API improves it'd be great to see bug reports and PRs from the community
      Hide
      jbonofre Jean-Baptiste Onofré added a comment -

      Neville Li updated the branch to Beam 0.6.0, so, I think we can discuss about a merge in Apache codebase after a little cleanup. Thought ?

      Show
      jbonofre Jean-Baptiste Onofré added a comment - Neville Li updated the branch to Beam 0.6.0, so, I think we can discuss about a merge in Apache codebase after a little cleanup. Thought ?
      Hide
      nehalecky Nicholaus E Halecky added a comment -

      Hi all! Wonderful to see the progress made here so far, and was interested to know the status of this effort?

      Show
      nehalecky Nicholaus E Halecky added a comment - Hi all! Wonderful to see the progress made here so far, and was interested to know the status of this effort?
      Hide
      amitsela Amit Sela added a comment -

      Oh, got it, thanks!

      Show
      amitsela Amit Sela added a comment - Oh, got it, thanks!
      Show
      sinisa_lyh Neville Li added a comment - WIP branch here using 0.4.0 https://github.com/spotify/scio/tree/apache-beam Ticket https://github.com/spotify/scio/issues/279
      Hide
      amitsela Amit Sela added a comment -

      You mean 0.5.0 ?

      Show
      amitsela Amit Sela added a comment - You mean 0.5.0 ?
      Hide
      jbonofre Jean-Baptiste Onofré added a comment -

      I updated to 0.4.0 release and I will deal with Neville for the merge.

      Show
      jbonofre Jean-Baptiste Onofré added a comment - I updated to 0.4.0 release and I will deal with Neville for the merge.
      Hide
      amitsela Amit Sela added a comment -

      Davor Bonaci where are we with Scio integration ?

      Show
      amitsela Amit Sela added a comment - Davor Bonaci where are we with Scio integration ?
      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      Thanks Amit for the clarification. Any idea for which release version of Beam that scio integration can be finished?

      Show
      yuchaoran2011 Chaoran Yu added a comment - Thanks Amit for the clarification. Any idea for which release version of Beam that scio integration can be finished?
      Hide
      amitsela Amit Sela added a comment -

      Scio currently supports the Dataflow SDK (sort of Beam predecessor), and once it will support Beam it could interact with any runner supporting the Java SDK since Scio is a Scala DSL running on top of the Java SDK.

      Show
      amitsela Amit Sela added a comment - Scio currently supports the Dataflow SDK (sort of Beam predecessor), and once it will support Beam it could interact with any runner supporting the Java SDK since Scio is a Scala DSL running on top of the Java SDK.
      Hide
      yuchaoran2011 Chaoran Yu added a comment -

      Looks like scio currently only supports Google Cloud Dataflow as the underlying runner. Now that the project is donated to Beam, are there any plans to support Spark, Flink and other runners?

      Show
      yuchaoran2011 Chaoran Yu added a comment - Looks like scio currently only supports Google Cloud Dataflow as the underlying runner. Now that the project is donated to Beam, are there any plans to support Spark, Flink and other runners?
      Hide
      jbonofre Jean-Baptiste Onofré added a comment -

      Resuming tests and changes on Scio.

      Show
      jbonofre Jean-Baptiste Onofré added a comment - Resuming tests and changes on Scio.
      Hide
      jbonofre Jean-Baptiste Onofré added a comment -

      Awesome ! Thanks. As discussed by e-mail, I started to test it.

      Show
      jbonofre Jean-Baptiste Onofré added a comment - Awesome ! Thanks. As discussed by e-mail, I started to test it.
      Hide
      sinisa_lyh Neville Li added a comment -

      I ported 2 modules over so far:

      • scio-core into sdks/scala/core, this is the core Scala DSL
      • scio-test into sdks/scala/core, this includes utilities for writing idiomatic Scala tests and tests for scio-core

      Question is, is sdks/scala the right place or should we move it to another top-level module i.e. dsls/scio?

      Show
      sinisa_lyh Neville Li added a comment - I ported 2 modules over so far: scio-core into sdks/scala/core , this is the core Scala DSL scio-test into sdks/scala/core , this includes utilities for writing idiomatic Scala tests and tests for scio-core Question is, is sdks/scala the right place or should we move it to another top-level module i.e. dsls/scio ?
      Hide
      kenn Kenneth Knowles added a comment -

      Here you go! I will go ahead and make the name a little more precise, since right now it is more of a vague wish.

      Show
      kenn Kenneth Knowles added a comment - Here you go! I will go ahead and make the name a little more precise, since right now it is more of a vague wish.
      Hide
      sinisa_lyh Neville Li added a comment - - edited
      Show
      sinisa_lyh Neville Li added a comment - - edited I'm working on porting Scio to Beam. Can this be assigned to me? https://github.com/spotify/scio/tree/apache-beam https://github.com/nevillelyh/incubator-beam/tree/scio

        People

        • Assignee:
          Unassigned
          Reporter:
          jbonofre Jean-Baptiste Onofré
        • Votes:
          1 Vote for this issue
          Watchers:
          14 Start watching this issue

          Dates

          • Created:
            Updated:

            Development