[SPARK-21520] Improvement a special case for non-deterministic projects in optimizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects. Deal with that we only need to read user needs fields for non-deterministic projects in optimizer.
For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated:

*HashAggregate(keys=k#403L, functions=partial_sum(cast(id#402 as bigint)), output=k#403L, sum#800L)
+- Project d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 10000.0)) AS k#403L
+- HiveTableScan c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields, MetastoreRelation XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task.

Attachments

Issue Links

is duplicated by

SPARK-14172 Hive table partition predicate not passed down correctly

Resolved

SPARK-27969 Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance

Resolved

links to

[Github] Pull Request #18969 (heary-cao)

https://github.com/apache/spark/pull/18725

https://github.com/apache/spark/pull/18892

Activity

People

Assignee:: Unassigned

Reporter:: caoxuewen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Jul/17 10:34

Updated:: 25/May/21 01:49

Resolved:: 25/May/21 01:42