Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-3084

Spark Data Frames Support in Apache Ignite

    Details

    • Type: Task
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0.final
    • Fix Version/s: 2.4
    • Component/s: spark
    • Labels:

      Description

      Apache Spark already benefits from integration with Apache Ignite. The latter provides shared RDDs, an implementation of Spark RDD, that help Spark to share a state between Spark workers and execute SQL queries much faster. The next logical step is to enable support for modern Spark Data Frames API in a similar way.

      As a contributor, you will be fully in charge of the integration of Spark Data Frame API and Apache Ignite.

        Issue Links

          Activity

          Hide
          vkulichenko Valentin Kulichenko added a comment -

          I made some investigation and here is what in my view needs to be done to support integration between Ignite and Spark DataFrame.

          1. Provide implementation of BaseRelation mixed with PrunedFilteredScan. It should be able to execute a query based on provided filters and selected fields and return RDD that iterates through results. Since RDD works on per partition level, most likely we will need to add an ability to run SQL query on a particular partition.
          2. Provide implementation of Catalog to properly lookup Ignite relations.
          3. Create IgniteSQLContext that will override the catalog.

          Steps above will add a new datasource to Spark. However generally, while Spark is executing a query, it first fetches data from the source to its own memory to create RDDs. Therefore this is not enough for Ignite because we already have data in memory. In case there is only Ignite data participating in the query, we want Spark to issue a query directly to Ignite.

          To accomplish this we can provide our own implementation of Strategy which Spark uses to convert logical plan to physical plan. For any type of LogicalPlan, this custom strategy should be able to generate SQL query for Ignite, based on the whole plan tree. If there are non-Ignite relations in the plan, we should fall back to native Spark strategies (return Nil as a physical plan).

          IgniteSQLContext should append the custom strategy to collection of Spark strategies. Here is a good example of how custom strategy can be created and injected: https://gist.github.com/marmbrus/f3d121a1bc5b6d6b57b9

          Show
          vkulichenko Valentin Kulichenko added a comment - I made some investigation and here is what in my view needs to be done to support integration between Ignite and Spark DataFrame. Provide implementation of BaseRelation mixed with PrunedFilteredScan . It should be able to execute a query based on provided filters and selected fields and return RDD that iterates through results. Since RDD works on per partition level, most likely we will need to add an ability to run SQL query on a particular partition. Provide implementation of Catalog to properly lookup Ignite relations. Create IgniteSQLContext that will override the catalog. Steps above will add a new datasource to Spark. However generally, while Spark is executing a query, it first fetches data from the source to its own memory to create RDDs. Therefore this is not enough for Ignite because we already have data in memory. In case there is only Ignite data participating in the query, we want Spark to issue a query directly to Ignite. To accomplish this we can provide our own implementation of Strategy which Spark uses to convert logical plan to physical plan. For any type of LogicalPlan , this custom strategy should be able to generate SQL query for Ignite, based on the whole plan tree. If there are non-Ignite relations in the plan, we should fall back to native Spark strategies (return Nil as a physical plan). IgniteSQLContext should append the custom strategy to collection of Spark strategies. Here is a good example of how custom strategy can be created and injected: https://gist.github.com/marmbrus/f3d121a1bc5b6d6b57b9
          Hide
          vozerov Vladimir Ozerov added a comment -

          Val,

          Cool analysis! I would say that executing query-on-partition is very useful feature. Not only it will help us with Spark, but will allow us to perform certain useful SQL optimizations (e.g. IGNITE-4509 and IGNITE-4510).

          I am not quite sure I understand how to work with plans and strategies. Does it mean that we will have to analyze SQL somehow (e.g. build AST) to give correct hints to Spark?

          Show
          vozerov Vladimir Ozerov added a comment - Val, Cool analysis! I would say that executing query-on-partition is very useful feature. Not only it will help us with Spark, but will allow us to perform certain useful SQL optimizations (e.g. IGNITE-4509 and IGNITE-4510 ). I am not quite sure I understand how to work with plans and strategies. Does it mean that we will have to analyze SQL somehow (e.g. build AST) to give correct hints to Spark?
          Hide
          vkulichenko Valentin Kulichenko added a comment -

          Logical plan (which is actually AST) is built by Spark based on the API calls you make. It supports both SQL (Spark parses it by itself in this case) and chain methods like filter(..), join(..), etc. Logical plan is then converted to physical plan which defines how the logical plan is actually executed. So basically we need a strategy that will generate SQL query for Ignite based on AST provided by Spark.

          In addition to this, MemSQL provides an option to execute SQL query as is when SQLContext.sql(..) method is called (i.e. it bypasses Spark query parser/planner). Not sure this is really useful because this implies adding another method on top of standard API, but it's fairly easy to add, so it make sense to do the same.

          Show
          vkulichenko Valentin Kulichenko added a comment - Logical plan (which is actually AST) is built by Spark based on the API calls you make. It supports both SQL (Spark parses it by itself in this case) and chain methods like filter(..) , join(..) , etc. Logical plan is then converted to physical plan which defines how the logical plan is actually executed. So basically we need a strategy that will generate SQL query for Ignite based on AST provided by Spark. In addition to this, MemSQL provides an option to execute SQL query as is when SQLContext.sql(..) method is called (i.e. it bypasses Spark query parser/planner). Not sure this is really useful because this implies adding another method on top of standard API, but it's fairly easy to add, so it make sense to do the same.
          Hide
          polinank Polina Koleva added a comment - - edited

          HI,
          I am Polina, master student, Computer Science, 2nd year at University of Freiburg, Germany. I want to participate in gsoc 2017 and I find this Jira Task interesting. I have already worked on a master project with Spark (mainly Spark SQL), Hadoop and Hive. Moreover, I have been working as a part-time Java developer since 2013. Unfortunately, I do not have any experience working with Ignite. Can you provide any advice or a good starting point to understand more about this issue?

          Show
          polinank Polina Koleva added a comment - - edited HI, I am Polina, master student, Computer Science, 2nd year at University of Freiburg, Germany. I want to participate in gsoc 2017 and I find this Jira Task interesting. I have already worked on a master project with Spark (mainly Spark SQL), Hadoop and Hive. Moreover, I have been working as a part-time Java developer since 2013. Unfortunately, I do not have any experience working with Ignite. Can you provide any advice or a good starting point to understand more about this issue?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user nizhikov opened a pull request:

          https://github.com/apache/ignite/pull/2742

          IGNITE-3084: Prototype of Ignite DataFrame support.

          • DataSource and DataFrame implementation for Ignite SQL Tables.
          • IgniteCatalog implementation.
          • Some examples provided.
          • There is only placeholer for an Ignite extra optimization and strategies.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/nizhikov/ignite IGNITE-3084

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/ignite/pull/2742.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2742


          commit f54c51e863665b170a6c59d6c57a9dd52caf3a13
          Author: Nikolay Izhikov <nizhikov.dev@gmail.com>
          Date: 2017-09-05T14:02:46Z

          IGNITE-3084: Prototype of Ignite DataFrame support. IgniteCatalog implementation. Some examples provided. There is only placeholer for an Ignite extra optimization and strategies.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user nizhikov opened a pull request: https://github.com/apache/ignite/pull/2742 IGNITE-3084 : Prototype of Ignite DataFrame support. DataSource and DataFrame implementation for Ignite SQL Tables. IgniteCatalog implementation. Some examples provided. There is only placeholer for an Ignite extra optimization and strategies. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nizhikov/ignite IGNITE-3084 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/2742.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2742 commit f54c51e863665b170a6c59d6c57a9dd52caf3a13 Author: Nikolay Izhikov <nizhikov.dev@gmail.com> Date: 2017-09-05T14:02:46Z IGNITE-3084 : Prototype of Ignite DataFrame support. IgniteCatalog implementation. Some examples provided. There is only placeholer for an Ignite extra optimization and strategies.
          Hide
          NIzhikov Nikolay Izhikov added a comment -

          Valentin Kulichenko, Anton Vinogradov, hello.

          Please, review my changes.

          Show
          NIzhikov Nikolay Izhikov added a comment - Valentin Kulichenko , Anton Vinogradov , hello. Please, review my changes.

            People

            • Assignee:
              NIzhikov Nikolay Izhikov
              Reporter:
              vozerov Vladimir Ozerov
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development