[IMPALA-3607] Reduce test data loading time from snapshot - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: Impala 2.5.0
Fix Version/s: None
Component/s: Infrastructure
Labels:
- test-infra

Target Version:

Product Backlog

Description

Loading test data from snapshot takes a significant amount of time (~20-30min). Given the amount of data loaded (~4GB), the process of loading test data to a local 3-node min-hdfs cluster should be significantly faster. The process currently works as follows:
1. Download the latest snapshot
2. Unzip
3. Use hdfs dfs -put command to copy from local file system to hdfs

We believe the bulk of the time goes to step #3 and is attributed to namenode overhead. Below are a few ideas we can try to improve this:
1. Use a backup and restore approach for hdfs metadata/data that doesn't go through the namenode. For example, once data is loaded to an hdfs cluster using the old approach create two snapshots, one for metadata and one for data. Loading the test data is just a matter of unzipping the snapshots to the appropriate directories. A similar approach is used to backup and restore hdfs clusters (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html). A jenkins job would still be responsible for checking for changes in test data, do the slow data loading and creating the new snapshots.
2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing.
3. Use faster compression/decompression tools.

Attachments

Issue Links

is related to

IMPALA-3227 Ensure that tests and dataloading can be run efficiently w/o Cloudera infra

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dimitris Tsirogiannis

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/May/16 18:12

Updated:: 22/Feb/19 18:44

Resolved:: 22/Feb/19 18:44