Hive
  1. Hive
  2. HIVE-6843

INSTR for UTF-8 returns incorrect position

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.11.0, 0.12.0
    • Fix Version/s: 0.14.0
    • Component/s: UDF
    • Labels:
      None
    1. HIVE-6843.patch
      3 kB
      Szehon Ho
    2. HIVE-6843.2.patch
      4 kB
      Szehon Ho

      Issue Links

        Activity

        Clif Kranish created issue -
        Hide
        Clif Kranish added a comment -

        Using the INSTR fuction to find the posistion of a substring for a UTF-8 returns zero

        select INSTR (‘НАСТРОЕние’, ‘P’) from foo-bar

        Show
        Clif Kranish added a comment - Using the INSTR fuction to find the posistion of a substring for a UTF-8 returns zero select INSTR (‘НАСТРОЕние’, ‘P’) from foo-bar
        Hide
        Szehon Ho added a comment -

        Hi, I was going to look at this , but when I tried it looks like that your first P is a Cyrillic P (d0,a0), while the second is a English P (50). Can you verify? If you make the second a Cyrlilic P, than it works.

        Show
        Szehon Ho added a comment - Hi, I was going to look at this , but when I tried it looks like that your first P is a Cyrillic P (d0,a0), while the second is a English P (50). Can you verify? If you make the second a Cyrlilic P, than it works.
        Szehon Ho made changes -
        Field Original Value New Value
        Assignee Szehon Ho [ szehon ]
        Hide
        Clif Kranish added a comment -

        Sorry, copy/paste got me. They look the same. And sorry about the curly quotes, I don't know where they came from.

        The real issue is that for UTF-8 INSTR returns the position in bytes instead of characters. So this reutrns a 9 where by my count it should be a 5. Thank you for your support.

        select INSTR ('НАСТРОЕние', 'Р') from ....

        Show
        Clif Kranish added a comment - Sorry, copy/paste got me. They look the same. And sorry about the curly quotes, I don't know where they came from. The real issue is that for UTF-8 INSTR returns the position in bytes instead of characters. So this reutrns a 9 where by my count it should be a 5. Thank you for your support. select INSTR ('НАСТРОЕние', 'Р') from ....
        Hide
        Szehon Ho added a comment -

        This seems to work, lets see what folks think.

        Original code was trying to avoid encoding the bytes and just doing byte-counting, but not sure if that is possible when doing unicode char calculations.

        Show
        Szehon Ho added a comment - This seems to work, lets see what folks think. Original code was trying to avoid encoding the bytes and just doing byte-counting, but not sure if that is possible when doing unicode char calculations.
        Szehon Ho made changes -
        Attachment HIVE-6843.patch [ 12639052 ]
        Szehon Ho made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Szehon Ho made changes -
        Remote Link This issue links to "review board (Web Link)" [ 14827 ]
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12639052/HIVE-6843.patch

        ERROR: -1 due to 1 failed/errored test(s), 5549 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat
        

        Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 1 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12639052

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12639052/HIVE-6843.patch ERROR: -1 due to 1 failed/errored test(s), 5549 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed This message is automatically generated. ATTACHMENT ID: 12639052
        Hide
        Jason Dere added a comment -

        Should this also work for unicode characters which require more than one Java character? If you add these checks to TestGenericUDFUtils, the 2nd check fails:

            Assert.assertEquals(3, GenericUDFUtils.findText(new Text("123\uD801\uDC00456"), new Text("\uD801\uDC00"), 0));
            Assert.assertEquals(4, GenericUDFUtils.findText(new Text("123\uD801\uDC00456"), new Text("4"), 0));
        

        This would require using String.codePointCount() on the indexOf() result.

        Show
        Jason Dere added a comment - Should this also work for unicode characters which require more than one Java character? If you add these checks to TestGenericUDFUtils, the 2nd check fails: Assert.assertEquals(3, GenericUDFUtils.findText( new Text( "123\uD801\uDC00456" ), new Text( "\uD801\uDC00" ), 0)); Assert.assertEquals(4, GenericUDFUtils.findText( new Text( "123\uD801\uDC00456" ), new Text( "4" ), 0)); This would require using String.codePointCount() on the indexOf() result.
        Hide
        Szehon Ho added a comment -

        Thanks for the review. As I understand, you are passing in a string literal to Text constructor, so it is not interpreting \uD801 as one char, so there is actually 5 chars there: '\', 'u', 'D', '8', '0', '1'.

        I tried the following test and it seemed to work:

        char[] chararray = new char[]

        {'1', '2', '3', '\uD801', '\uDC00', '4', '5', '6'}

        ;
        String str = new String(chararray);
        Assert.assertEquals(5, GenericUDFUtils.findText(new Text(str), new Text("4"), 0));

        I guess the second check was supposed to be 5, not 4.

        Show
        Szehon Ho added a comment - Thanks for the review. As I understand, you are passing in a string literal to Text constructor, so it is not interpreting \uD801 as one char, so there is actually 5 chars there: '\', 'u', 'D', '8', '0', '1'. I tried the following test and it seemed to work: char[] chararray = new char[] {'1', '2', '3', '\uD801', '\uDC00', '4', '5', '6'} ; String str = new String(chararray); Assert.assertEquals(5, GenericUDFUtils.findText(new Text(str), new Text("4"), 0)); I guess the second check was supposed to be 5, not 4.
        Hide
        Jason Dere added a comment -

        The string literal does interpret "\uD801" as a single character, and "\uD801\uDC00" as a single code point (got the example character from http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html):

            String str1 = "123\uD801\uDC00456";
            System.out.println("str1 length=" + str1.length() + ", codePointCount=" + str1.codePointCount(0, str1.length()));
        
        str1 length=8, codePointCount=7
        

        So if we count things by unicode code points, the "4" would be at index 4 (for 0-based index).

        Show
        Jason Dere added a comment - The string literal does interpret "\uD801" as a single character, and "\uD801\uDC00" as a single code point (got the example character from http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html): String str1 = "123\uD801\uDC00456"; System.out.println("str1 length=" + str1.length() + ", codePointCount=" + str1.codePointCount(0, str1.length())); str1 length=8, codePointCount=7 So if we count things by unicode code points, the "4" would be at index 4 (for 0-based index).
        Hide
        Jason Dere added a comment -

        From that link, "\uD801\uDC00" would be the representation for U+10400

        Show
        Jason Dere added a comment - From that link, "\uD801\uDC00" would be the representation for U+10400
        Hide
        Szehon Ho added a comment -

        Sorry Jason I missed that. Made the fix to handle supplementary characters (surrogate pairs), and updated the review board.

        Show
        Szehon Ho added a comment - Sorry Jason I missed that. Made the fix to handle supplementary characters (surrogate pairs), and updated the review board.
        Szehon Ho made changes -
        Attachment HIVE-6843.2.patch [ 12640922 ]
        Hide
        Jason Dere added a comment -

        I think it looks good, +1

        Show
        Jason Dere added a comment - I think it looks good, +1
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12640922/HIVE-6843.2.patch

        ERROR: -1 due to 1 failed/errored test(s), 5406 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
        

        Test results: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 1 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12640922

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12640922/HIVE-6843.2.patch ERROR: -1 due to 1 failed/errored test(s), 5406 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 Test results: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/testReport Console output: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed This message is automatically generated. ATTACHMENT ID: 12640922
        Hide
        Jason Dere added a comment -

        Committed to trunk, thanks Szehon!

        Show
        Jason Dere added a comment - Committed to trunk, thanks Szehon!
        Jason Dere made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.14.0 [ 12326450 ]
        Resolution Fixed [ 1 ]
        Hide
        Thejas M Nair added a comment -

        This has been fixed in 0.14 release. Please open new jira if you see any issues.

        Show
        Thejas M Nair added a comment - This has been fixed in 0.14 release. Please open new jira if you see any issues.
        Thejas M Nair made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Szehon Ho
            Reporter:
            Clif Kranish
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development