[SPARK-19842] Informational Referential Integrity Constraints Support in Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Informational Referential Integrity Constraints Support in Spark

This work proposes support for informational primary key and foreign key (referential integrity) constraints in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics.

An informational or statistical constraint is a constraint such as a unique, primary key, foreign key, or check constraint, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations.

The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark.

Link to the google doc: InformationalRIConstraints

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

InformationalRIConstraints.doc
07/Mar/17 00:30
277 kB
Ioana Delaney

Issue Links

duplicates

SPARK-29584 NOT NULL is not supported in Spark

Resolved

is related to

SPARK-29702 Resolve group-by columns with integrity constraints

Open

SPARK-27953 Save default constraint with Column into table properties when create Hive table

Resolved

SPARK-29119 DEFAULT option is not supported in Spark

Resolved

IMPALA-3531 Implement deferrable and optionally enforced PK/FK constraints

In Progress

SPARK-15302 Implement FK/PK "rely novalidate" constraints for better CBO

Resolved

(1 is related to)

Sub-Tasks

1.	Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys	In Progress	Unassigned
2.	ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.	Open	Unassigned
3.	DROP TABLE should automatically drop any dependent referential constraints or raise error.	Open	Unassigned
4.	Add support for VALIDATE option to ALTER TABLE ADD CONSTRAINT command	Open	Unassigned
5.	Add DDL support to modify constraint state for a given constraint	Open	Unassigned
6.	Hints for fact tables and unique columns	Resolved	Unassigned
7.	[Performance] Inner Join Elimination based on Informational RI constraints	In Progress	Unassigned
8.	[Performance] Existential Subquery to Inner Join	Open	Unassigned
9.	[Performance] Group By Push Down through Join	Open	Unassigned
10.	[Performance] Distinct elimination	Open	Unassigned
11.	[Performance] Redundant join elimination	Open	Unassigned
12.	[Performance] Star schema detection improvements	Open	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Ioana Delaney

Votes:: 4 Vote for this issue

Watchers:: 34 Start watching this issue

Dates

Created:: 07/Mar/17 00:26

Updated:: 16/Apr/21 12:23