Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-5840

Incorrect SOUNDEX function semantics in MySQL library

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 1.34.0
    • None
    • core

    Description

      There are 2 problems.

      1. 4-char fixed length & mysql non-fixed length
      current in calcite:
      SELECT SOUNDEX('Quadratically');
          -> 'Q363'

      mysql
      mysql> SELECT SOUNDEX('Quadratically');
          -> 'Q36324'

      As docs illustrated below, A standard soundex string is four characters long, but the MySQL SOUNDEX() function returns an arbitrarily long string.

      https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex
      https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex

      2. when current calcite mysql soundex deal with multi bytes string such as UTF8, it will cause IllegalArgumentException because we use Apache Commons Codec to get the result.

      however, MySQL is not this behavior. e.g.

      mysql> select soundex('字节');
      +-------------------+
      | soundex('字节')   |
      +-------------------+
      | 字000             |
      +-------------------+
      1 row in set (0.00 sec)
      
      mysql> select soundex('字节字节');
      +-------------------------+
      | soundex('字节字节')     |
      +-------------------------+
      | 字000                   |
      +-------------------------+
      1 row in set (0.10 sec)
      
      mysql> select soundex('字节字节字节');
      +-------------------------------+
      | soundex('字节字节字节')       |
      +-------------------------------+
      | 字000                         |
      +-------------------------------+
      1 row in set (0.03 sec)
      
      mysql> select soundex('バイト');
      +----------------------+
      | soundex('バイト')    |
      +----------------------+
      | バ000                |
      +----------------------+
      1 row in set (0.04 sec)
      
      mysql> select soundex('байт');
      +---------------------+
      | soundex('байт')     |
      +---------------------+
      | б000                |
      +---------------------+
      1 row in set (0.00 sec) 

      the different string parameter above means 'BYTE'.

      I think we need to correct the behavior of this function in MySQL library.

      Attachments

        Issue Links

          Activity

            People

              taoran Ran Tao
              taoran Ran Tao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: