Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
-
None
-
None
Description
Follow up to comments in TEZ-771 for cleaning up the tarballs to simplify maintenance.
Attachments
Attachments
- TEZ-788.patch
- 6 kB
- Jonathan Turner Eagles
- tez-dist-after-sorted.txt
- 2 kB
- Jonathan Turner Eagles
- tez-dist-before-sorted.txt
- 3 kB
- Jonathan Turner Eagles
- tez-dist-full-after-sorted.txt
- 8 kB
- Jonathan Turner Eagles
- tez-dist-full-before-sorted.txt
- 8 kB
- Jonathan Turner Eagles
Issue Links
- is related to
-
TEZ-771 Dist build broken after TEZ-749
- Closed
Activity
My take: ( sseth and bikassaha may have differing opinions:
- For the non - full distribution
- include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used.
- slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by gkesavan.
- The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies.
- Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits.
- From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern.
For the non - full distribution
include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used.
Removed all the transitive dependencies from the non-full distro
slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by Giridharan Kesavan.
Removed the slf4j dependencies from the full distro
The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies.
Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits.
From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern.
I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath?
I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath?
Not really - the main reason was to ensure that whatever client libs Tez compiles against are in the localized set as there is no guarantee that a binary compatible client jar will be available on the cluster. The servers on the cluster need to be backward compatible - however, the client jars ( if present on the cluster ) need not be.
Before committing can a summary of the current state be posted please. Maybe also add something to the documentation also.
Differences with the current patch
- Hadoop jars (not even client) exist in either tez-dist or tez-dist-full
- tez jars are now pulled up into top level from lib
- tez-dist now has direct dependencies of all tez jars in lib instead of direct dependencies of only tez-api
- there may be a difference in a jar missing or added based on the above depending on whether tez lists the dependency directly or was using it transitively. Further audit of this will be needed.
Hmm. Actually, I have found that not having the mapreduce client jars in the tez jar location in hdfs causes tez to have class not found issues. Perhaps I should leave these in until a later time. Is this use case supported currently or is there more work to be done.
Jon
jeagles At the moment, all the jars in the minimal tarball are augmented with the yarn classpath ( i.e. HDFS, COMMON and YARN jars only ) to create the runtime env.
Just to clarify so I can wrap this up soon. Tez augments only HDFS, COMMON, and YARN jars but NOT MAPREDUCE jars. If this is the case, then MAPREDUCE client jars will need to be added like they were before.
jeagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars.
sseth any comments?
I have made the changes so the the mapreduce-client jars are made available. Any other thoughts?
jeagles Sorry for the delay - was waiting on bikassaha or sseth to chime in. Could you shed some light on the removal of the client dependencies? For now, it should not be an issue assuming everything is compatible but at some point, there is a need to have the client jars also be part of the distribution.
Initially I thought to remove the map reduce client libs as part of this jira, but then I thought to leave that decision until later. Hope that answers your question.
Sorry - should have been more clear. For mapreduce, i believe the current patch is retaining the required jars. I was wondering about the yarn-client dependency being removed?
Jonathan Eagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars.
Right. The main JAR from MapReduce should be included in the tez distribution (client-core). We're pulling in Shuffle and Common as well currently. Need to investigate if these are mandatory.
On the YARN client dependencies - is relying on the versions deployed on the cluster sufficient. YARN, at least within the 2.x line, is supposed to be backward compatible ? This control should likely be left to the cluster administrators. If Tez were to ship YARN client libraries, and were to be deployed on a cluster with an older version of YARN (lets say with some missing methods) - that just doesn't work. At that point, we either ensure the correct version / rebuild Tez for specific versions of YARN. I'd really like to avoid going down the route of shimming in Tez to handle such situations though.
The remaining libraries are also interesting. Tez depends on certain libraries which are also used by Hadoop (Guava, HTTPClient etc). Should Tez be including a copy of these libraries in it's build - or should it rely on the copy included with Hadoop. Have seen cases in the recent past where some cleanup of unnecessary dependencies in Hadoop has caused Tez dist installs to fail. We may just be better shipping all direct dependencies, and supporting a Post install localization step for all nodes.
There are a lot of great ideas and forward thinking here. As far as 0.3.0 release is concerned, what aspects (or all) of this do we want to accomplish with this JIRA? I want to make sure we are all in agreement to ease this in.
I'd say - the required MapReduce libraries, and other libraries for which we know that Tez has direct dependencies. For now I guess it's OK to depend on Hadoop for some of the dependencies ( Hadoop 2.2, 2.3 and 2.4 should work imo though). Thoughts ?
jeagles, Looked at the before and after diff. There's a bunch of jars like common-compress, jackson, stax, xz which were in the old build, but are missing from the new one. Do you have any idea why these were included previously and not now. I don't know if anything in Tez directly depends on these - but MapReduce dependencies may need to be pulled in (if that's where they were from).
Attaching a much better dependency list. The thing to look at here is that tez-dist only includes direct dependencies (and no transitive) vs tez-dist-full includes both direct and transitive dependencies. The reason there is a difference from before is that tez-dist before really included only direct dependencies for items in the top level, but transitive dependencies for items in lib. Hence, due to the move of tez jars to the top level some transitive dependencies are no longer included. In this new version I have also made sure to compare identical versions of hadoop
Committing this. Thanks for the effort and patience jeagles.
Between the current state and the post patch state - I think the only jar which will be missing will be netty (considering yarn / hdfs / common classpath). Afaik, netty is not used by Tez or the components of MapReduce that tez depends on. (can likely be removed from the hadoop build itself)
I'm going to open yet another follow up jira to try listing direct dependencies etc.
Starting with a dead simple distribution with few exception and go from there
Questions