Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6372

Dataload should execute Hive loads in parallel

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.12.0
    • Impala 3.0
    • Infrastructure
    • None
    • ghx-label-7

    Description

      Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and Impala inserts (for Parquet), but it does not go parallel on the Hive portion of dataload. testdata/bin/generate-schema-statements.py only generates a single Hive file that load-data.py executes serially.

      generate-schema-statements.py should generate multiple SQL files for the Hive load portion and load-data.py should execute them in parallel (once the base text tables have been loaded).

      Even with parallel execution of TPC-H, functional, and TPC-DS, this will still deliver speedups for dataload (and GVO).

      Attachments

        Activity

          People

            joemcdonnell Joe McDonnell
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: