[IMPALA-6372] Dataload should execute Hive loads in parallel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.12.0
Fix Version/s: Impala 3.0
Component/s: Infrastructure
Labels:
None

Epic Color:
ghx-label-7

Description

Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and Impala inserts (for Parquet), but it does not go parallel on the Hive portion of dataload. testdata/bin/generate-schema-statements.py only generates a single Hive file that load-data.py executes serially.

generate-schema-statements.py should generate multiple SQL files for the Hive load portion and load-data.py should execute them in parallel (once the base text tables have been loaded).

Even with parallel execution of TPC-H, functional, and TPC-DS, this will still deliver speedups for dataload (and GVO).

Attachments

Activity

People

Assignee:: Joe McDonnell

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Jan/18 01:27

Updated:: 10/May/18 21:48

Resolved:: 16/Apr/18 18:30