Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
1.6.2, 2.1.0
-
None
-
None
Description
Given two dataframes loaded from the same JDBC connection:
UnoptimizedJDBCJoin.scala
val ordersDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.orders") .option("user", "username") .option("password", "password") .load().toDS val productDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.product") .option("user", "username") .option("password", "password") .load().toDS ordersDF.createOrReplaceTempView("orders") productDF.createOrReplaceTempView("product") // Followed by a join between them: val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name
Attachments
Issue Links
- duplicates
-
SPARK-12449 Pushing down arbitrary logical plans to data sources
- Resolved