Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6372

Dataload should execute Hive loads in parallel

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.12.0
    • Fix Version/s: Impala 3.0
    • Component/s: Infrastructure
    • Labels:
      None
    • Epic Color:
      ghx-label-7

      Description

      Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and Impala inserts (for Parquet), but it does not go parallel on the Hive portion of dataload. testdata/bin/generate-schema-statements.py only generates a single Hive file that load-data.py executes serially.

      generate-schema-statements.py should generate multiple SQL files for the Hive load portion and load-data.py should execute them in parallel (once the base text tables have been loaded).

      Even with parallel execution of TPC-H, functional, and TPC-DS, this will still deliver speedups for dataload (and GVO).

        Attachments

          Activity

            People

            • Assignee:
              joemcdonnell Joe McDonnell
              Reporter:
              joemcdonnell Joe McDonnell
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: