[IMPALA-1618] Impala server should always try to fulfill requested fetch size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: Impala 2.0.1
Fix Version/s: Impala 3.4.0
Component/s: Backend
Labels:
- usability

Target Version:

Product Backlog

Description

The thrift fetch request specifies the number of rows that it would like but the Impala server may return fewer even though more results are available.

For example, using the default row_batch size of 1024, if the client requests 1023 rows, the first response contains 1023 rows but the second response contains only 1 row. This is because the server internally uses row_batch (1024), returns the requested count (1023) and caches the remaining row, then the next time around only uses the cache.

In general the end user should set both the row batch size and the thrift request size. In practice the query writer setting row_batch and the driver/programmer setting fetch size may often be different people.

There is one case that works fine now though - setting the batch size to less than the thrift req size. In this case the thrift response is always the same as batch size.

Code example:

dev@localhost:~/impyla$ git diff
diff --git a/impala/_rpc/hiveserver2.py b/impala/_rpc/hiveserver2.py
index 6139002..31fdab7 100644
--- a/impala/_rpc/hiveserver2.py
+++ b/impala/_rpc/hiveserver2.py
@@ -265,6 +265,7 @@ def fetch_results(service, operation_handle, hs2_protocol_version, schema=None,
     req = TFetchResultsReq(operationHandle=operation_handle,
                            orientation=orientation,
                            maxRows=max_rows)
+    print("req: " + str(max_rows))
     resp = service.FetchResults(req)
     err_if_rpc_not_ok(resp)
 
@@ -273,6 +274,7 @@ def fetch_results(service, operation_handle, hs2_protocol_version, schema=None,
                  for (i, col) in enumerate(resp.results.columns)]
         num_cols = len(tcols)
         num_rows = len(tcols[0].values)
+        print("rec: " + str(num_rows))
         rows = []
         for i in xrange(num_rows):
             row = []


dev@localhost:~/impyla$ cat test.py 
from impala.dbapi import connect

conn = connect()
cur = conn.cursor()
cur.set_arraysize(1024)
cur.execute("set batch_size=1025")
cur.execute("select * from tpch.lineitem")
while True:
    rows = cur.fetchmany()
    if not rows:
        break

cur.close()
conn.close()


dev@localhost:~/impyla$ python test.py | head
Failed to import pandas
req: 1024
rec: 1024
req: 1024
rec: 1
req: 1024
rec: 1024
req: 1024
rec: 1
req: 1024

Attachments

Issue Links

is duplicated by

IMPALA-1790 FetchResults() sometimes returns very few resuts

Resolved

IMPALA-3015 Thrift buffer size not honored when retrieving data from Impala

Resolved

is related to

IMPALA-4268 Rework coordinator buffering to buffer more data

Resolved

relates to

IMPALA-8819 BufferedPlanRootSink should handle non-default fetch sizes

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: casey

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Dec/14 21:44

Updated:: 14/May/20 17:44

Resolved:: 29/Aug/19 20:19