Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-12
Description
When building Impala, we need to download lots of dependencies.
joemcdonnell helps to scrutinize where all the jars are coming from:
Number of artifacts downloaded from each repo:
16 cdh.rcs.releases.repo
2067 central
203 impala.cdp.repo
2 impala.toolchain.kudu.repo
In my local env, the majority of the build time is spent in downloading artifacts from Cloudera's S3 bucket. There are some large files, e.g.
458.2 MiB llvm-5.0.1-asserts-p3-gcc-7.5.0-ec2-package-ubuntu-16-04.tar.gz 373.4 MiB llvm-5.0.1-p3-gcc-7.5.0-ec2-package-ubuntu-16-04.tar.gz 1.1 GiB kudu-6a7cadc7e-gcc-7.5.0-ec2-package-ubuntu-16-04.tar.gz 333.0 MiB apache-hive-3.1.3000.7.2.7.0-44-bin.tar.gz 377.2 MiB hadoop-3.1.1.7.2.7.0-44.tar.gz 370.4 MiB hbase-2.2.6.7.2.7.0-44-bin.tar.gz 258.3 MiB ranger-2.1.0.7.2.7.0-44-admin.tar.gz 63.4 MiB tez-0.9.1.7.2.7.0-44-minimal.tar.gz
Downloading from S3 is super slow in China and maybe other places around the world. One solution is refactoring our dependencies to be on Apache released versions (IMPALA-10408) so we can download them from Apache mirrors.
Another solution is providing alternative download sources like Alibaba Cloud or qcloud (Tencent Cloud). Developers can choose or setup their own sources.