I made some investigation and here is what in my view needs to be done to support integration between Ignite and Spark DataFrame.
- Provide implementation of BaseRelation mixed with PrunedFilteredScan. It should be able to execute a query based on provided filters and selected fields and return RDD that iterates through results. Since RDD works on per partition level, most likely we will need to add an ability to run SQL query on a particular partition.
- Provide implementation of Catalog to properly lookup Ignite relations.
- Create IgniteSQLContext that will override the catalog.
Steps above will add a new datasource to Spark. However generally, while Spark is executing a query, it first fetches data from the source to its own memory to create RDDs. Therefore this is not enough for Ignite because we already have data in memory. In case there is only Ignite data participating in the query, we want Spark to issue a query directly to Ignite.
To accomplish this we can provide our own implementation of Strategy which Spark uses to convert logical plan to physical plan. For any type of LogicalPlan, this custom strategy should be able to generate SQL query for Ignite, based on the whole plan tree. If there are non-Ignite relations in the plan, we should fall back to native Spark strategies (return Nil as a physical plan).
IgniteSQLContext should append the custom strategy to collection of Spark strategies. Here is a good example of how custom strategy can be created and injected: https://gist.github.com/marmbrus/f3d121a1bc5b6d6b57b9