Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
1.34.0
-
None
Description
There are 2 problems.
1. 4-char fixed length & mysql non-fixed length
current in calcite:
SELECT SOUNDEX('Quadratically');
-> 'Q363'
mysql
mysql> SELECT SOUNDEX('Quadratically');
-> 'Q36324'
As docs illustrated below, A standard soundex string is four characters long, but the MySQL SOUNDEX() function returns an arbitrarily long string.
https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex
https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex
2. when current calcite mysql soundex deal with multi bytes string such as UTF8, it will cause IllegalArgumentException because we use Apache Commons Codec to get the result.
however, MySQL is not this behavior. e.g.
mysql> select soundex('字节'); +-------------------+ | soundex('字节') | +-------------------+ | 字000 | +-------------------+ 1 row in set (0.00 sec) mysql> select soundex('字节字节'); +-------------------------+ | soundex('字节字节') | +-------------------------+ | 字000 | +-------------------------+ 1 row in set (0.10 sec) mysql> select soundex('字节字节字节'); +-------------------------------+ | soundex('字节字节字节') | +-------------------------------+ | 字000 | +-------------------------------+ 1 row in set (0.03 sec) mysql> select soundex('バイト'); +----------------------+ | soundex('バイト') | +----------------------+ | バ000 | +----------------------+ 1 row in set (0.04 sec) mysql> select soundex('байт'); +---------------------+ | soundex('байт') | +---------------------+ | б000 | +---------------------+ 1 row in set (0.00 sec)
the different string parameter above means 'BYTE'.
I think we need to correct the behavior of this function in MySQL library.
Attachments
Issue Links
- is related to
-
CALCITE-2965 Implement string functions: REPEAT, SPACE, SOUNDEX, DIFFERENCE
-
- Closed
-
- links to