[IMPALA-9777] Reduce the diskspace requirements of loading the text version of tpcds.store_sales - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 4.0.0
Fix Version/s: Impala 4.0.0
Component/s: Infrastructure
Labels:
None

Target Version:

Impala 4.0.0
Epic Color:
ghx-label-10

Description

Currently, dataload for the Impala development environment uses Hive to populate tpcds.store_sales. We use several insert statements that select from tpcds.stores_sales_unpartitioned, which is loaded from text files. The inserts have this form:

insert overwrite table {table_name} partition(ss_sold_date_sk)
select ss_sold_time_sk,
  ss_item_sk,
  ss_customer_sk,
  ss_cdemo_sk,
  ss_hdemo_sk,
  ss_addr_sk,
  ss_store_sk,
  ss_promo_sk,
  ss_ticket_number,
  ss_quantity,
  ss_wholesale_cost,
  ss_list_price,
  ss_sales_price,
  ss_ext_discount_amt,
  ss_ext_sales_price,
  ss_ext_wholesale_cost,
  ss_ext_list_price,
  ss_ext_tax,
  ss_coupon_amt,
  ss_net_paid,
  ss_net_paid_inc_tax,
  ss_net_profit,
  ss_sold_date_sk
from store_sales_unpartitioned
WHERE ss_sold_date_sk < 2451272
distribute by ss_sold_date_sk;

Since this is inserting into a partitioned table, it is creating a file per partition. Each statement manipulates hundreds of partitions. With the current settings, the Hive implementation of this insert opens several hundred files simultaneously (by my measurement, ~450). HDFS reserves a whole block for each file (even though the resulting files are not large), and if there isn't enough disk space for all of the reservations, then these inserts can fail. This is a common problem on development environments. This is currently failing for erasure coding tests.

Impala uses clustered inserts where the input is sorted and files are written one at a time (per backend). This limits the number of simultaneously open files, eliminating the corresponding disk space reservation. Switching populating tpcds.store_sales to use Impala would reduce the diskspace requirement for an Impala developer environment. Alternatively, there is likely equivalent Hive functionality for doing an initial sort so that only one partition needs to be written at a time.

This only applies to the text version of store_sales, which is created from store_sales_unpartitioned. All other formats are created from the text version of store_sales. Since the text store_sales is already partitioned in the same way as the destination store_sales, Hive can be more efficient, processing a small number of partitions at a time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

namenodeparse.py
27/May/20 00:33
2 kB
Joe McDonnell

Issue Links

Blocked

IMPALA-10507 Workaround for sporadic TPC-DS testdata insert error

Closed

is duplicated by

IMPALA-9806 Multiple data load failures on HDFS errors for erasure coding builds

Resolved

is related to

IMPALA-9794 OutOfMemoryError when loading tpcds text data via Hive

Resolved

Activity

People

Assignee:: Sahil Takiar

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/May/20 22:20

Updated:: 17/Feb/21 18:20

Resolved:: 02/Jun/20 00:50