Uploaded image for project: 'Thrift'
  1. Thrift
  2. THRIFT-5303

Unicode decode errors in _fast_decode

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11.0
    • 0.14.0
    • Python - Library
    • None
    • Ubuntu 16.04.6 LTS

    Description

      Impala currently uses thrift-0.11.0 on client side and thrift-0.9.3 on server side (server side upgrade is blocked by some issues). We encountered an issue in decoding utf8 bytes on the client side. The result has a partial utf8 code point. But thrift is not handling the error elegantly. The stacktrace:

      Traceback (most recent call last):
        File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1210, in _do_beeswax_rpc
          ret = rpc()
        File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1113, in <lambda>
          self.fetch_size))
        File "/home/quanlong/workspace/Impala/shell/build/thrift-11-gen/gen-py/beeswaxd/BeeswaxService.py", line 254, in fetch
          return self.recv_fetch()
        File "/home/quanlong/workspace/Impala/shell/build/thrift-11-gen/gen-py/beeswaxd/BeeswaxService.py", line 275, in recv_fetch
          result.read(iprot)
        File "/home/quanlong/workspace/Impala/shell/build/thrift-11-gen/gen-py/beeswaxd/BeeswaxService.py", line 1410, in read
          iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
      UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 3: unexpected end of data 

      This is similar to THRIFT-2087, but the error happens in the boundary between Python and C++ codes. Just like THRIFT-2087, we need to provide an error handling behavior of decoding utf-8 bytes in TBinaryProtocolAccelerated._fast_decode. The related codes are https://github.com/apache/thrift/blob/0.11.0/lib/py/src/ext/protocol.tcc#L708

        case T_STRING: {
          char* buf = NULL;
          int len = impl()->readString(&buf);
          if (len < 0) {
            return NULL;
          }
          if (isUtf8(typeargs)) {
            return PyUnicode_DecodeUTF8(buf, len, 0);  <--- Needs to provide an error handling method here
          } else {
            return PyBytes_FromStringAndSize(buf, len);
          }
        }
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stigahuang Quanlong Huang
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment