Affects Version/s: Impala 3.4.0
Fix Version/s: None
In working on better support for large sql queries in
IMPALA-9414, I found that impala-shell is very slow at processing large query sizes.
To test this, I generated a sql file of 1MB that refers to a non-existent table (so that the time to run the query would be negligible). Running this query file with impala-shell on my local machine takes about 20s, of which about 13s are spent in parse_query_text(), which uses some sqlparse functions to try to split the query text into multiple queries.
This seems like an unreasonable overhead and could definitely be improved. Some ideas for how to do that:
1. Be more clever with our use of sqlparse to get better perf. This probably has limited value (eg. strip_comments() already tries to be very clever but is still pretty slow)
2. Find a different python library for sql parsing that is faster (this may not exist).
3. Add some C++ into the shell instead of always doing everything in pure python (not sure how easy/convenient this is to integrate with the shell packaging)
4. Try to write our own sql parsing code, which could be optimized for the small number of things we need actually need, eg. we don't need full tokenization just splitting of multiple queries (likely to be bug-prone)
5. Do some simple hacks, such as skipping the query splitting entirely if there isn't a ';' in the query text (this would leave some unfortunate perf cliffs, eg. add a ';' to a string literal in your query and suddenly everything gets a lot slower)
6. Add an interface in Impala that allows submitting of multiple queries at once, eg ExecuteStatements(), which returns a list of query_ids. (might be a lot of work to modify impala-server, the parser, etc. to support this)
7. Add an interface in Impala that allows submitting of query text, then parses it and returns it in split form without actually executing it, which would limit the amount of changes needed vs. option 6