Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: SparkR
    • Labels:
      None
    • Target Version/s:

      Description

      The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3].

      Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base.

      SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration.

      [1] https://github.com/amplab-extras/SparkR-pkg
      [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
      [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2

        Issue Links

          Activity

          Hide
          shivaram Shivaram Venkataraman added a comment -

          Issue resolved by pull request 5096
          https://github.com/apache/spark/pull/5096

          Show
          shivaram Shivaram Venkataraman added a comment - Issue resolved by pull request 5096 https://github.com/apache/spark/pull/5096
          Hide
          apachespark Apache Spark added a comment -

          User 'shivaram' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5096

          Show
          apachespark Apache Spark added a comment - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/5096
          Hide
          apachespark Apache Spark added a comment -

          User 'davies' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5077

          Show
          apachespark Apache Spark added a comment - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5077
          Hide
          pwendell Patrick Wendell added a comment -

          I see the decision here as somewhat orthogonal to vendors and vendor packaging. Vendors can chose whether to package this component or not, and some may leave it out until it gets more mature. Of course, they are more encouraged/pressured to package things that end up inside the project itself, but that could be used to justify merging all kinds of random stuff into Spark, so I don't think it's a sufficient justification.

          The main argument as I said before is just that non-JVM language API's are really just not possible to maintain outside of the project, because it's not building on any even remotely "public" API. Imagine if we tried to have PySpark as it's own project, it is so tightly coupled that it wouldn't work.

          I have argued in the past for things to existing outside the project when they can, and that I still promote that strongly.

          Show
          pwendell Patrick Wendell added a comment - I see the decision here as somewhat orthogonal to vendors and vendor packaging. Vendors can chose whether to package this component or not, and some may leave it out until it gets more mature. Of course, they are more encouraged/pressured to package things that end up inside the project itself, but that could be used to justify merging all kinds of random stuff into Spark, so I don't think it's a sufficient justification. The main argument as I said before is just that non-JVM language API's are really just not possible to maintain outside of the project, because it's not building on any even remotely "public" API. Imagine if we tried to have PySpark as it's own project, it is so tightly coupled that it wouldn't work. I have argued in the past for things to existing outside the project when they can, and that I still promote that strongly.
          Hide
          harisekhon Hari Sekhon added a comment - - edited

          Ok replace the word packaging with upstream integration and support, similar to HCatalog going in to Hive because it makes sense. This way it's standardized to all platforms, not the whim of a particular vendor's packaging strategy to bolt it on for you or DIY. I agree with Matei and Jason that it seems like a logical extension of the major language support that makes Spark so accessible. A lot of people know R and feel more comfortable sticking in their RStudio, this would surely benefit the Apache Spark project's popularity and accessibility even more, and help Databricks etc.

          Show
          harisekhon Hari Sekhon added a comment - - edited Ok replace the word packaging with upstream integration and support, similar to HCatalog going in to Hive because it makes sense. This way it's standardized to all platforms, not the whim of a particular vendor's packaging strategy to bolt it on for you or DIY. I agree with Matei and Jason that it seems like a logical extension of the major language support that makes Spark so accessible. A lot of people know R and feel more comfortable sticking in their RStudio, this would surely benefit the Apache Spark project's popularity and accessibility even more, and help Databricks etc.
          Hide
          srowen Sean Owen added a comment -

          Hari Sekhon Yes, we work for banks as you know. I don't follow why SparkR has to be part of Apache Spark for, let's suppose, Cloudera to package SparkR in CDH if desired. These are different ideas. Packaging and support does have value, and takes work; we agree. You're saying you don't want to do that packaging work, and who would? It doesn't follow that it's the Spark project that should do it for you. It shifts vendor-ish work to the volunteers in the open source project, IMHO. You say support is important, and it is, but putting Spark in Spark does not create any production support for SparkR by itself – the Apache project is not something that people get support contracts from.

          You're saying there is market demand, and I don't doubt it. I think what you're arguing is that vendors like Cloudera should package and support SparkR, and that may be, though I do not see any demand on this end yet. But then I'd merely suggest we agree, but that this is answering a different question than the one here, which is whether it should be a "mandatory" part of the Apache project. Not vendor distros.

          Show
          srowen Sean Owen added a comment - Hari Sekhon Yes, we work for banks as you know. I don't follow why SparkR has to be part of Apache Spark for, let's suppose, Cloudera to package SparkR in CDH if desired. These are different ideas. Packaging and support does have value, and takes work; we agree. You're saying you don't want to do that packaging work, and who would? It doesn't follow that it's the Spark project that should do it for you. It shifts vendor-ish work to the volunteers in the open source project, IMHO. You say support is important, and it is, but putting Spark in Spark does not create any production support for SparkR by itself – the Apache project is not something that people get support contracts from. You're saying there is market demand, and I don't doubt it. I think what you're arguing is that vendors like Cloudera should package and support SparkR, and that may be, though I do not see any demand on this end yet. But then I'd merely suggest we agree, but that this is answering a different question than the one here, which is whether it should be a "mandatory" part of the Apache project. Not vendor distros.
          Hide
          harisekhon Hari Sekhon added a comment -

          Sean - ever worked for a bank?

          What you've said is tantamount to saying Cloudera has zero value because people can download Apache Hadoop for free from the Apache website and carefully select compatible component versions (remember Pig vs Hadoop version mismatches anyone), then hand-write all the XML and build all the automation and packaging yourself, then self-support it based on documentation and code diving (those days before CDH were good for learning and bad for productivity btw).

          Commercial support and professional pre-packaged integration are very important to financials and other large traditional enterprises (eg. Experian, another former employer) - exactly the environments where the vendors need to make their bread - those compile-it-yourself self-supporting web-scale companies like I worked for before Cloudera rarely pay vendors!

          Btw I did build SparkR a few times - quite frankly I'm sick of dealing with it for every cluster, every release, and differing versions of stuff that need to line up to avoid serial id mismatch exceptions etc.

          Nobody wants to give this to quants as a production tool without any support because of the nature of these large environments, the buck has to stop with somebody - and nobody wants to put their own head on the chopping block for supplying unsupported technology - that's one of the reasons vendors like Databricks, Cloudera, Hortonworks etc exist.

          I know Alteryx are also eager for it, another tool we use and another problem area of scale it would solve for all their customers (technically they could rewrite in one of the other API langs but given they already have modules in R, SparkR would make a bit more sense to port to), as well as other data scientists I used to work with know who were talking about wanting this early last year... we thought it would have happened by now... I even asked people a few months ago such as one of the SparkR guys and vendors who I was told had spoken to Databricks about it but I've just realized I should have also raised a jira like this directly here myself as I usually do.

          Now Revolution R has been bought by Microsoft, the timing for Databricks to add this is good too.

          Show
          harisekhon Hari Sekhon added a comment - Sean - ever worked for a bank? What you've said is tantamount to saying Cloudera has zero value because people can download Apache Hadoop for free from the Apache website and carefully select compatible component versions (remember Pig vs Hadoop version mismatches anyone), then hand-write all the XML and build all the automation and packaging yourself, then self-support it based on documentation and code diving (those days before CDH were good for learning and bad for productivity btw). Commercial support and professional pre-packaged integration are very important to financials and other large traditional enterprises (eg. Experian, another former employer) - exactly the environments where the vendors need to make their bread - those compile-it-yourself self-supporting web-scale companies like I worked for before Cloudera rarely pay vendors! Btw I did build SparkR a few times - quite frankly I'm sick of dealing with it for every cluster, every release, and differing versions of stuff that need to line up to avoid serial id mismatch exceptions etc. Nobody wants to give this to quants as a production tool without any support because of the nature of these large environments, the buck has to stop with somebody - and nobody wants to put their own head on the chopping block for supplying unsupported technology - that's one of the reasons vendors like Databricks, Cloudera, Hortonworks etc exist. I know Alteryx are also eager for it, another tool we use and another problem area of scale it would solve for all their customers (technically they could rewrite in one of the other API langs but given they already have modules in R, SparkR would make a bit more sense to port to), as well as other data scientists I used to work with know who were talking about wanting this early last year... we thought it would have happened by now... I even asked people a few months ago such as one of the SparkR guys and vendors who I was told had spoken to Databricks about it but I've just realized I should have also raised a jira like this directly here myself as I usually do. Now Revolution R has been bought by Microsoft, the timing for Databricks to add this is good too.
          Hide
          srowen Sean Owen added a comment -

          Hari Sekhon I think that's an orthogonal concern. It can be packaged and aligned and all that by anyone, including you for your customer, without it being in the Spark project. Why do you tell them they can't use SparkR?

          The upside to integrating it is of course that it's forced to stay more aligned and probably more readily accessible. The downside is more maintenance burden passed on to the project, even for the majority of downstream consumers that don't care about R or SparkR. This is really not trivial. (I personally do not think this is something that belongs in the Spark project but would not strongly object.)

          FWIW there are some customers I've talked to that are interested in R and Spark, but nobody has requested SparkR. This is probably mostly due to it all being still new to the mainstream. This colors my perception of the tradeoff, though, the view of demand from where I sit is a useful data point.

          Show
          srowen Sean Owen added a comment - Hari Sekhon I think that's an orthogonal concern. It can be packaged and aligned and all that by anyone, including you for your customer, without it being in the Spark project. Why do you tell them they can't use SparkR? The upside to integrating it is of course that it's forced to stay more aligned and probably more readily accessible. The downside is more maintenance burden passed on to the project, even for the majority of downstream consumers that don't care about R or SparkR. This is really not trivial. (I personally do not think this is something that belongs in the Spark project but would not strongly object.) FWIW there are some customers I've talked to that are interested in R and Spark, but nobody has requested SparkR. This is probably mostly due to it all being still new to the mainstream. This colors my perception of the tradeoff, though, the view of demand from where I sit is a useful data point.
          Hide
          harisekhon Hari Sekhon added a comment - - edited

          SparkR absolutely must go in to mainline and be shipped in core Spark by all the vendors... having to deal with it separately and compile with no support because it's an add-on are all major barriers to enterprise adoption and hurts Spark's offering too.

          The stakeholders at my current banking client are literally crying out for SparkR over and over and when we tell them to go use PySpark instead they still insist that R is too important a language to them.

          Also some vendors that want to build on Spark would benefit from SparkR to replace existing product workflows where standard R integration is currently used (eg. Alteryx).

          Show
          harisekhon Hari Sekhon added a comment - - edited SparkR absolutely must go in to mainline and be shipped in core Spark by all the vendors... having to deal with it separately and compile with no support because it's an add-on are all major barriers to enterprise adoption and hurts Spark's offering too. The stakeholders at my current banking client are literally crying out for SparkR over and over and when we tell them to go use PySpark instead they still insist that R is too important a language to them. Also some vendors that want to build on Spark would benefit from SparkR to replace existing product workflows where standard R integration is currently used (eg. Alteryx).
          Hide
          jason.dai Jason Dai added a comment -

          I agree with this proposal. Given all ongoing efforts around data analytics in Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very well aligned with the project's future plans.

          Show
          jason.dai Jason Dai added a comment - I agree with this proposal. Given all ongoing efforts around data analytics in Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very well aligned with the project's future plans.
          Hide
          srowen Sean Owen added a comment -

          (SGTM, that's good authoritative reasoning. I was mostly prompting the question.)

          Show
          srowen Sean Owen added a comment - (SGTM, that's good authoritative reasoning. I was mostly prompting the question.)
          Hide
          matei Matei Zaharia added a comment -

          Yup, there's a tradeoff, but given that this is a language API and not an algorithm, input source or anything like that, I think it's important to support it along with the core engine. R is extremely popular for data science, more so than Python, and it fits well with many existing concepts in Spark.

          Show
          matei Matei Zaharia added a comment - Yup, there's a tradeoff, but given that this is a language API and not an algorithm, input source or anything like that, I think it's important to support it along with the core engine. R is extremely popular for data science, more so than Python, and it fits well with many existing concepts in Spark.
          Hide
          pwendell Patrick Wendell added a comment -

          It's a fair point to ask about something like this and I am a huge supporter of having more decoupled community-driven projects around Spark. The main criteria I think are worth evaluating are the benefit to the project, the long term maintenance responsibility, and how easy it would be for the project to exist outside of the Spark codebase. In this case, for a language API like this, it's hard for me to see it succeeding outside of the project. If you look at PySpark, there are a lot of internal optimizations and modifications we end up doing for it, because it doesn't build cleanly on top of other API's. In fairness, that reasoning alone could justify having a million random language API's in Spark, which I don't think we want. So there for me is also a sense of deciding whether R is worth escalating to a first class language in the Spark community. My feeling is that this is worth doing given what I've perceived of demand for R. I wouldn't anticipate adding any additional language API's in the future. However, let's continue to discuss. It's by no means black-and-white.

          Show
          pwendell Patrick Wendell added a comment - It's a fair point to ask about something like this and I am a huge supporter of having more decoupled community-driven projects around Spark. The main criteria I think are worth evaluating are the benefit to the project, the long term maintenance responsibility, and how easy it would be for the project to exist outside of the Spark codebase. In this case, for a language API like this, it's hard for me to see it succeeding outside of the project. If you look at PySpark, there are a lot of internal optimizations and modifications we end up doing for it, because it doesn't build cleanly on top of other API's. In fairness, that reasoning alone could justify having a million random language API's in Spark, which I don't think we want. So there for me is also a sense of deciding whether R is worth escalating to a first class language in the Spark community. My feeling is that this is worth doing given what I've perceived of demand for R. I wouldn't anticipate adding any additional language API's in the future. However, let's continue to discuss. It's by no means black-and-white.
          Hide
          shivaram Shivaram Venkataraman added a comment -

          Thanks Sean Owen for your comment. As with anything there is a cost-benefit trade-off here and I think in this case the benefits are significant. In the short term, the integration will help with stable, co-ordinated releases for users and make it easier for downstream packaging efforts etc. In the longer term, given R's popularity as a data science language, I think its crucial for the Spark project to have a well supported interface for R users – whether that looks like RDDs or DataFrames etc. is a different question – and I think from the project's perspective this is a great opportunity to reach more users.

          In terms of complexity, the SparkR code base has only around 1000 lines of Scala code and 4000 lines of R core (1/3rd are test cases) and is pretty small compared to most of the other components.

          Anyways, thats the trade-off I see and as this JIRA is more of an RFC, it'll be great to hear other viewpoints as well.

          Show
          shivaram Shivaram Venkataraman added a comment - Thanks Sean Owen for your comment. As with anything there is a cost-benefit trade-off here and I think in this case the benefits are significant. In the short term, the integration will help with stable, co-ordinated releases for users and make it easier for downstream packaging efforts etc. In the longer term, given R's popularity as a data science language, I think its crucial for the Spark project to have a well supported interface for R users – whether that looks like RDDs or DataFrames etc. is a different question – and I think from the project's perspective this is a great opportunity to reach more users. In terms of complexity, the SparkR code base has only around 1000 lines of Scala code and 4000 lines of R core (1/3rd are test cases) and is pretty small compared to most of the other components. Anyways, thats the trade-off I see and as this JIRA is more of an RFC, it'll be great to hear other viewpoints as well.
          Hide
          srowen Sean Owen added a comment -

          Predictably, I'll ask: is this not something that could simply remain a stand-alone project? It can be given visibility at http://spark-packages.org/ The argument for putting everything into one code base so as to synchronize it has its limit, and I sense there is push-back on adding any more to an already complex project. JMHO

          Show
          srowen Sean Owen added a comment - Predictably, I'll ask: is this not something that could simply remain a stand-alone project? It can be given visibility at http://spark-packages.org/ The argument for putting everything into one code base so as to synchronize it has its limit, and I sense there is push-back on adding any more to an already complex project. JMHO

            People

            • Assignee:
              shivaram Shivaram Venkataraman
              Reporter:
              shivaram Shivaram Venkataraman
            • Votes:
              4 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development