Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The reason for the slow compilation: The Hadoop project has many modules, and the inability to compile them in parallel results in a slow process. For instance, the first compilation of Hadoop might take several hours, and even with local Maven dependencies, a subsequent compilation can still take close to 40 minutes, which is very slow.
How to solve it: Use mvn dependency:tree and maven-to-plantuml to investigate the dependency issues that prevent parallel compilation.
- Investigate the dependencies between project modules.
- Analyze the dependencies in multi-module Maven projects.
- Download maven-to-plantuml:
wget https://github.com/phxql/maven-to-plantuml/releases/download/v1.0/maven-to-plantuml-1.0.jar
- Generate a dependency tree:
mvn dependency:tree > dep.txt
- Generate a UML diagram from the dependency tree:
java -jar maven-to-plantuml.jar --input dep.txt --output dep.puml
For more information, visit: maven-to-plantuml GitHub repository.
Hadoop Parallel Compilation Submission Logic
- Reasons for Parallel Compilation Failure
-
- In sequential compilation, as modules are compiled one by one in order, there are no errors because the compilation follows the module sequence.
- However, in parallel compilation, all modules are compiled simultaneously. The compilation order during multi-module concurrent compilation depends on the inter-module dependencies. If Module A depends on Module B, then Module B will be compiled before Module A. This ensures that the compilation order follows the dependencies between modules.
But when Hadoop compiles in parallel, for example, compiling hadoop-yarn-project, the dependencies between modules are correct. The issue arises during the dist package stage. dist packages all other compiled modules.
Behavior of hadoop-yarn-project in Serial Compilation:
-
- In serial compilation, it compiles modules in the pom one by one in sequence. After all modules are compiled, it compiles hadoop-yarn-project. During the prepare-package stage, the maven-assembly-plugin plugin is executed for packaging. All packages are repackaged according to the description in hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml.
Behavior of hadoop-yarn-project in Parallel Compilation:
- In serial compilation, it compiles modules in the pom one by one in sequence. After all modules are compiled, it compiles hadoop-yarn-project. During the prepare-package stage, the maven-assembly-plugin plugin is executed for packaging. All packages are repackaged according to the description in hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml.
-
- Parallel compilation compiles modules according to the dependency order among them. If modules do not declare dependencies on each other through dependency, they are compiled in parallel. According to the dependency definition in the pom of hadoop-yarn-project, the dependencies are compiled first, followed by hadoop-yarn-project, executing its maven-assembly-plugin.
- However, the files needed for packaging in hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml are not all included in the dependency of hadoop-yarn-project. Therefore, when compiling hadoop-yarn-project and executing maven-assembly-plugin, not all required modules are built yet, leading to errors in parallel compilation.
Solution:
-
- The solution is relatively straightforward: organize all modules from hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml, and then declare them as dependencies in the pom of hadoop-yarn-project.
Attachments
Issue Links
- Blocked
-
HADOOP-19019 Parallel Maven Build Support for Apache Hadoop
- Resolved
- depends upon
-
BIGTOP-4044 Enhance Bigtop with Concurrent Compilation Support for Additional Components
- Resolved