Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
Description
Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunities
- Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
- Optimize the "contains" for-loop
Implementation Details
1)
- Added a new integer variable "asciiMode" to the VarCharHolder class
- The default value is -1 which indicates this info is not known
- Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
- The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
- The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
- The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
- Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
- Lineitem table 100GB
- Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
- Before changes: 33sec
- After changes : 27sec