[SPARK-18209] More robust view canonicalization without full SQL expansion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

2.2.0

Description

Spark SQL currently stores views by analyzing the provided SQL and then generating fully expanded SQL out of the analyzed logical plan. This is actually a very error prone way of doing it, because:

1. It is non-trivial to guarantee that the generated SQL is correct without being extremely verbose, given the current set of operators.

2. We need extensive testing for all combination of operators.

3. Whenever we introduce a new logical plan operator, we need to be super careful because it might break SQL generation. This is the main reason broadcast join hint has taken forever to be merged because it is very difficult to guarantee correctness.

Given the two primary reasons to do view canonicalization is to provide the context for the database as well as star expansion, I think we can this through a simpler approach, by taking the user given SQL, analyze it, and just wrap the original SQL with a SELECT clause at the outer and store the database as a hint.

For example, given the following view creation SQL:

USE DATABASE my_db;
CREATE TABLE my_table (id int, name string);
CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;

We store the following SQL instead:

SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE id > 10);

During parsing time, we expand the view along using the provided database context.

(We don't need to follow exactly the same hint, as I'm merely illustrating the high level approach here.)

Note that there is a chance that the underlying base table(s)' schema change and the stored schema of the view might differ from the actual SQL schema. In that case, I think we should throw an exception at runtime to warn users. This exception can be controlled by a flag.

Update 1: based on the discussion below, we don't even need to put the view definition in a sub query. We can just add it via a logical plan at the end.

Update 2: we should make sure permanent views do not depend on temporary objects (views, tables, or functions).

Attachments

Issue Links

is related to

SPARK-16756 Add `sql` function to LogicalPlan and `NonSQLPlan` trait

Closed

SPARK-25797 Views created via 2.1 cannot be read via 2.2+

Resolved

SPARK-16475 Broadcast Hint for SQL Queries

Resolved

SPARK-16576 Move plan SQL generation code from SQLBuilder into logical operators

Closed

relates to

SPARK-17982 SQLBuilder should wrap the generated SQL with parenthesis for LIMIT

Resolved

SPARK-11012 Canonicalize view definitions

Resolved

links to

[Github] Pull Request #16168 (jiangxb1987)

[Github] Pull Request #16561 (jiangxb1987)

Design Doc

(1 relates to, 3 links to)

Sub-Tasks

1.	Disallow creating permanent views based on temporary views or UDFs	Resolved	Xiao Li
2.	Disallow cyclic view reference	Resolved	Xingbo Jiang
3.	Support resolve a nested view	Resolved	Xingbo Jiang
4.	Implement new approach to write a permanent view	Resolved	Xingbo Jiang
5.	Remove SQL builder for operators	Resolved	Xingbo Jiang
6.	Explicitly prevent Insert into View or Create View As Insert	Resolved	Xingbo Jiang
7.	Parse the view query in HiveSessionCatalog	Resolved	Wenchen Fan
8.	Remove the param `viewOriginalText` from `CatalogTable`	Resolved	Xingbo Jiang
9.	Improve the test coverage of SQLViewSuite	Resolved	Xingbo Jiang
10.	remove the view identifier in SubqueryAlias	Resolved	Xingbo Jiang
11.	Restrict the nested level of a view	Resolved	Xingbo Jiang

Activity

People

Assignee:: Unassigned

Reporter:: Reynold Xin

Votes:: 2 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 01/Nov/16 21:38

Updated:: 24/Oct/18 07:11

Resolved:: 08/Mar/17 18:25