Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3607

Reduce test data loading time from snapshot

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • Impala 2.5.0
    • None
    • Infrastructure

    Description

      Loading test data from snapshot takes a significant amount of time (~20-30min). Given the amount of data loaded (~4GB), the process of loading test data to a local 3-node min-hdfs cluster should be significantly faster. The process currently works as follows:
      1. Download the latest snapshot
      2. Unzip
      3. Use hdfs dfs -put command to copy from local file system to hdfs

      We believe the bulk of the time goes to step #3 and is attributed to namenode overhead. Below are a few ideas we can try to improve this:
      1. Use a backup and restore approach for hdfs metadata/data that doesn't go through the namenode. For example, once data is loaded to an hdfs cluster using the old approach create two snapshots, one for metadata and one for data. Loading the test data is just a matter of unzipping the snapshots to the appropriate directories. A similar approach is used to backup and restore hdfs clusters (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html). A jenkins job would still be responsible for checking for changes in test data, do the slow data loading and creating the new snapshots.
      2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing.
      3. Use faster compression/decompression tools.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dtsirogiannis Dimitris Tsirogiannis
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: