[SPARK-20259] Support push down join optimizations in DataFrameReader when loading from JDBC - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 1.6.2, 2.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Given two dataframes loaded from the same JDBC connection:

UnoptimizedJDBCJoin.scala

val ordersDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.orders")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
val productDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.product")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
ordersDF.createOrReplaceTempView("orders")
productDF.createOrReplaceTempView("product")

// Followed by a join between them:
val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")

Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name

Attachments

Issue Links

duplicates

SPARK-12449 Pushing down arbitrary logical plans to data sources

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: John Muller

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Apr/17 19:11

Updated:: 12/Dec/22 18:10

Resolved:: 10/Apr/17 03:50