[IMPALA-9436] impala-shell is very slow for large query text sizes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: Impala 3.4.0
Fix Version/s: None
Component/s: Clients
Labels:
None

Epic Color:
ghx-label-6

Description

In working on better support for large sql queries in ~~IMPALA-9414~~, I found that impala-shell is very slow at processing large query sizes.

To test this, I generated a sql file of 1MB that refers to a non-existent table (so that the time to run the query would be negligible). Running this query file with impala-shell on my local machine takes about 20s, of which about 13s are spent in parse_query_text(), which uses some sqlparse functions to try to split the query text into multiple queries.

This seems like an unreasonable overhead and could definitely be improved. Some ideas for how to do that:
1. Be more clever with our use of sqlparse to get better perf. This probably has limited value (eg. strip_comments() already tries to be very clever but is still pretty slow)
2. Find a different python library for sql parsing that is faster (this may not exist).
3. Add some C++ into the shell instead of always doing everything in pure python (not sure how easy/convenient this is to integrate with the shell packaging)
4. Try to write our own sql parsing code, which could be optimized for the small number of things we need actually need, eg. we don't need full tokenization just splitting of multiple queries (likely to be bug-prone)
5. Do some simple hacks, such as skipping the query splitting entirely if there isn't a ';' in the query text (this would leave some unfortunate perf cliffs, eg. add a ';' to a string literal in your query and suddenly everything gets a lot slower)
6. Add an interface in Impala that allows submitting of multiple queries at once, eg ExecuteStatements(), which returns a list of query_ids. (might be a lot of work to modify impala-server, the parser, etc. to support this)
7. Add an interface in Impala that allows submitting of query text, then parses it and returns it in split form without actually executing it, which would limit the amount of changes needed vs. option 6

Attachments

Issue Links

relates to

IMPALA-7939 Impala shell not displaying results for a CTE query.

Resolved

IMPALA-7259 impala-shell is weirdly slow with some large queries

Resolved

IMPALA-9501 Upgrade sqlparse to a version that supports python 3.0

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Tauber-Marshall

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Feb/20 21:34

Updated:: 13/Mar/20 16:26