[HUDI-3594] Support standard Spark functions in Filter Exprs in Data Skipping - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11.0
Component/s: None
Labels:
- pull-request-available

Story Points:
2
Epic Link:
RFC-27 Multi Modal Indexing

Description

As part of this effort we're planning to (at the very least) support a suite of standard Spark functions when evaluating Data Filtering expressions w/in Data Skipping flow, for ex: when user is issuing a following query

SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'

We're able to relate such query to our Column Stats Index appropriately, therefore being able to do Data Skipping not only on the "raw" columns, but also upon simple derivative expressions on top of them (like standard function calls){}

Important to note here, is that only transformations that preserve the ordering of the source column can be applied. Transformations not preserving the ordering will render Column Stats index practically irrelevant (since no assumption could be made that values in the column derived by such transformations are ordered)

Attachments

Issue Links

relates to

HUDI-512 Support Logical Partitioning with Expression Index

Closed

links to

GitHub Pull Request #4996

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Mar/22 20:58

Updated:: 25/Mar/22 15:36

Resolved:: 25/Mar/22 15:36