Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.12.0
-
None
-
ghx-label-7
Description
Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and Impala inserts (for Parquet), but it does not go parallel on the Hive portion of dataload. testdata/bin/generate-schema-statements.py only generates a single Hive file that load-data.py executes serially.
generate-schema-statements.py should generate multiple SQL files for the Hive load portion and load-data.py should execute them in parallel (once the base text tables have been loaded).
Even with parallel execution of TPC-H, functional, and TPC-DS, this will still deliver speedups for dataload (and GVO).