Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
Impala 2.5.0
-
None
Description
Loading test data from snapshot takes a significant amount of time (~20-30min). Given the amount of data loaded (~4GB), the process of loading test data to a local 3-node min-hdfs cluster should be significantly faster. The process currently works as follows:
1. Download the latest snapshot
2. Unzip
3. Use hdfs dfs -put command to copy from local file system to hdfs
We believe the bulk of the time goes to step #3 and is attributed to namenode overhead. Below are a few ideas we can try to improve this:
1. Use a backup and restore approach for hdfs metadata/data that doesn't go through the namenode. For example, once data is loaded to an hdfs cluster using the old approach create two snapshots, one for metadata and one for data. Loading the test data is just a matter of unzipping the snapshots to the appropriate directories. A similar approach is used to backup and restore hdfs clusters (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html). A jenkins job would still be responsible for checking for changes in test data, do the slow data loading and creating the new snapshots.
2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing.
3. Use faster compression/decompression tools.
Attachments
Issue Links
- is related to
-
IMPALA-3227 Ensure that tests and dataloading can be run efficiently w/o Cloudera infra
- Resolved