[SPARK-22573] SQL Planner is including unnecessary columns in the projection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time.

Spark 2.1.2 TPC-H Query 18 Plan
Spark 2.2.0 TPC-H Query 18 Plan

TPC-H Query 18

select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE desc,O_ORDERDATE

Attachments

Issue Links

duplicates

SPARK-19712 EXISTS and Left Semi join do not produce the same plan

Resolved

links to

[Github] Pull Request #19804 (wangyum)

Activity

People

Assignee:: Unassigned

Reporter:: Rajkishore Hembram

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Nov/17 11:24

Updated:: 26/Nov/17 13:49

Resolved:: 26/Nov/17 13:49