Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4436

StringValue::StringCompare bug for non-ascii string

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      StringCompare has a huge bug when string contains non-ascii char (>127,for example: Chinese )。

      Here is my test code:

        std::string a = "\xE9 00000000000000000";                                                                            
        std::string b = "\x32 00000000000000000";                                                                            
        std::string c = "\xE5 00";
        // return -73, a < b
        printf("%d\n", StringCompare(a.c_str(), a.size(), b.c_str(), b.size(), std::min(a.size(), b.size()))); 
       // return -179, b < c
        printf("%d\n", StringCompare(b.c_str(), b.size(), c.c_str(), c.size(), std::min(b.size(), c.size())));        
        // return 4, a > c ??  
        printf("%d\n", StringCompare(a.c_str(), a.size(), c.c_str(), c.size(), std::min(a.size(), c.size())));  
      

      Here is the function StringCompare in string-value.inline.h:

      static inline int StringCompare(const char* s1, int n1, const char* s2, int n2, int len) {
        DCHECK_EQ(len, std::min(n1, n2));
        if (CpuInfo::IsSupported(CpuInfo::SSE4_2)) {
          while (len >= SSEUtil::CHARS_PER_128_BIT_REGISTER) {
            __m128i xmm0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(s1));
            __m128i xmm1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(s2));
            int chars_match = SSE4_cmpestri<SSEUtil::STRCMP_MODE>(xmm0,
                SSEUtil::CHARS_PER_128_BIT_REGISTER, xmm1,
                SSEUtil::CHARS_PER_128_BIT_REGISTER);
            if (chars_match != SSEUtil::CHARS_PER_128_BIT_REGISTER) {
              // BUG: use char perform subtraction
              return s1[chars_match] - s2[chars_match]
            }
            len -= SSEUtil::CHARS_PER_128_BIT_REGISTER;
            s1 += SSEUtil::CHARS_PER_128_BIT_REGISTER;
            s2 += SSEUtil::CHARS_PER_128_BIT_REGISTER;
          }
        }
        // TODO: for some reason memcmp is way slower than strncmp (2.5x)  why?
        int result = strncmp(s1, s2, len);
        if (result != 0) return result;
        return n1 - n2;
      }
      

      but in the code of strncmp:

      int
      strncmp(const char *s1, const char *s2, register size_t n)
      {
        register unsigned char u1, u2;
      
        while (n-- > 0)
          {
            u1 = (unsigned char) *s1++;
            u2 = (unsigned char) *s2++;
            if (u1 != u2)
              // here use unsigned char to perform subtraction
      	return u1 - u2;
            if (u1 == '\0')
      	return 0;
          }
        return 0;
      }
      

        Issue Links

          Activity

          Hide
          fish0515_impala_49b1 fishing added a comment -

          same cases with 4290

          Show
          fish0515_impala_49b1 fishing added a comment - same cases with 4290
          Hide
          dhecht Dan Hecht added a comment -

          commit 6937fa9a4caa9039a95d661ddf209777caa805f5
          Author: Dan Hecht <dhecht@cloudera.com>
          Date: Mon Nov 14 16:47:39 2016 -0800

          IMPALA-4436: StringValue::StringCompare() should match strncmp()

          According to the C standard, strncmp() interprets characters as
          unsigned, whereas StringCompare() uses char (which happens to be
          signed). This means that for values greater than 127, they don't give
          the same result (which is especially bad considering StringCompare()
          falls back to strncmp(), and so the answer depends on the mismatched
          position).

          Fix StringCompare() to interpret as unsigned char.

          Change-Id: Ic0750f98d8c5ef7d0c0ea279cd1f80b4acbad1be
          Reviewed-on: http://gerrit.cloudera.org:8080/5083
          Reviewed-by: Dan Hecht <dhecht@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          dhecht Dan Hecht added a comment - commit 6937fa9a4caa9039a95d661ddf209777caa805f5 Author: Dan Hecht <dhecht@cloudera.com> Date: Mon Nov 14 16:47:39 2016 -0800 IMPALA-4436 : StringValue::StringCompare() should match strncmp() According to the C standard, strncmp() interprets characters as unsigned, whereas StringCompare() uses char (which happens to be signed). This means that for values greater than 127, they don't give the same result (which is especially bad considering StringCompare() falls back to strncmp(), and so the answer depends on the mismatched position). Fix StringCompare() to interpret as unsigned char. Change-Id: Ic0750f98d8c5ef7d0c0ea279cd1f80b4acbad1be Reviewed-on: http://gerrit.cloudera.org:8080/5083 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins

            People

            • Assignee:
              dhecht Dan Hecht
              Reporter:
              cfreely Fu Lili
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development