Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4268 Rework coordinator buffering to buffer more data
  3. IMPALA-1618

Impala server should always try to fulfill requested fetch size

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • Impala 2.0.1
    • Impala 3.4.0
    • Backend

    Description

      The thrift fetch request specifies the number of rows that it would like but the Impala server may return fewer even though more results are available.

      For example, using the default row_batch size of 1024, if the client requests 1023 rows, the first response contains 1023 rows but the second response contains only 1 row. This is because the server internally uses row_batch (1024), returns the requested count (1023) and caches the remaining row, then the next time around only uses the cache.

      In general the end user should set both the row batch size and the thrift request size. In practice the query writer setting row_batch and the driver/programmer setting fetch size may often be different people.

      There is one case that works fine now though - setting the batch size to less than the thrift req size. In this case the thrift response is always the same as batch size.

      Code example:

      dev@localhost:~/impyla$ git diff
      diff --git a/impala/_rpc/hiveserver2.py b/impala/_rpc/hiveserver2.py
      index 6139002..31fdab7 100644
      --- a/impala/_rpc/hiveserver2.py
      +++ b/impala/_rpc/hiveserver2.py
      @@ -265,6 +265,7 @@ def fetch_results(service, operation_handle, hs2_protocol_version, schema=None,
           req = TFetchResultsReq(operationHandle=operation_handle,
                                  orientation=orientation,
                                  maxRows=max_rows)
      +    print("req: " + str(max_rows))
           resp = service.FetchResults(req)
           err_if_rpc_not_ok(resp)
       
      @@ -273,6 +274,7 @@ def fetch_results(service, operation_handle, hs2_protocol_version, schema=None,
                        for (i, col) in enumerate(resp.results.columns)]
               num_cols = len(tcols)
               num_rows = len(tcols[0].values)
      +        print("rec: " + str(num_rows))
               rows = []
               for i in xrange(num_rows):
                   row = []
      
      
      dev@localhost:~/impyla$ cat test.py 
      from impala.dbapi import connect
      
      conn = connect()
      cur = conn.cursor()
      cur.set_arraysize(1024)
      cur.execute("set batch_size=1025")
      cur.execute("select * from tpch.lineitem")
      while True:
          rows = cur.fetchmany()
          if not rows:
              break
      
      cur.close()
      conn.close()
      
      
      dev@localhost:~/impyla$ python test.py | head
      Failed to import pandas
      req: 1024
      rec: 1024
      req: 1024
      rec: 1
      req: 1024
      rec: 1024
      req: 1024
      rec: 1
      req: 1024
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              caseyc casey
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: