[TEZ-788] Clean up dist tarballs - ASF JIRA

Jonathan Turner Eagles added a comment - 13/Feb/14 04:25

Starting with a dead simple distribution with few exception and go from there

Questions

Which hadoop jars should we include in full and non-full distributions?
Do we need the special exclusion for slf4j?
The non-full distribution has many more transitive dependencies listed than before. Is this needed?

Jonathan Turner Eagles added a comment - 13/Feb/14 04:25 Starting with a dead simple distribution with few exception and go from there Questions Which hadoop jars should we include in full and non-full distributions? Do we need the special exclusion for slf4j? The non-full distribution has many more transitive dependencies listed than before. Is this needed?

Hitesh Shah added a comment - 13/Feb/14 05:58

My take: ( sseth and bikassaha may have differing opinions:

For the non - full distribution
include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used.
slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by gkesavan.
The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies.
Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits.

From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern.

Hitesh Shah added a comment - 13/Feb/14 05:58 My take: ( sseth and bikassaha may have differing opinions: For the non - full distribution include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used. slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by gkesavan . The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies. Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits. From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern.

Jonathan Turner Eagles added a comment - 13/Feb/14 16:35

For the non - full distribution

include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used.

Removed all the transitive dependencies from the non-full distro

slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by Giridharan Kesavan.

Removed the slf4j dependencies from the full distro

The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies.

Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits.

From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern.

I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath?

Jonathan Turner Eagles added a comment - 13/Feb/14 16:35 For the non - full distribution include only direct dependencies of hadoop i.e. all client api modules, common and some parts of mapreduce that are used. Removed all the transitive dependencies from the non-full distro slf4j was something seen recently when compiling against hadoop-2.4.0 I believe which was reported by Giridharan Kesavan. Removed the slf4j dependencies from the full distro The general assumption is that given that Tez will run on a Yarn Cluster, for the most part, the yarn classpath should have most of the runtime dependencies. Then again, my assumption is that given backward compatibility, tez clients could be using an older version of yarn client libs that should be able to communicate with a cluster running a higher version. In that respect, the non-full tarball should have all the necessary run-time dependencies. Most of the earlier filtering was to reduce the total no. of jars that need to be localized with the assumption that a YARN cluster would have the rest of the bits. From a big-top packaging point of view, where there is a 1:1 mapping i.e. a single full stack with only one version of each component, it may be necessary to strip all hadoop dependencies at some point. For now, I believe this should not be a concern. I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath?

Hitesh Shah added a comment - 13/Feb/14 19:01

I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath?

Not really - the main reason was to ensure that whatever client libs Tez compiles against are in the localized set as there is no guarantee that a binary compatible client jar will be available on the cluster. The servers on the cluster need to be backward compatible - however, the client jars ( if present on the cluster ) need not be.

Hitesh Shah added a comment - 13/Feb/14 19:01 I had removed the hadoop dependencies (including) before. Is the advantage of keeping the hadoop client jars there for the sole purpose of simplifying the classpath? Not really - the main reason was to ensure that whatever client libs Tez compiles against are in the localized set as there is no guarantee that a binary compatible client jar will be available on the cluster. The servers on the cluster need to be backward compatible - however, the client jars ( if present on the cluster ) need not be.

Bikas Saha added a comment - 13/Feb/14 19:45

Before committing can a summary of the current state be posted please. Maybe also add something to the documentation also.

Bikas Saha added a comment - 13/Feb/14 19:45 Before committing can a summary of the current state be posted please. Maybe also add something to the documentation also.

Jonathan Turner Eagles added a comment - 14/Feb/14 02:43

Differences with the current patch

Hadoop jars (not even client) exist in either tez-dist or tez-dist-full
tez jars are now pulled up into top level from lib
tez-dist now has direct dependencies of all tez jars in lib instead of direct dependencies of only tez-api
there may be a difference in a jar missing or added based on the above depending on whether tez lists the dependency directly or was using it transitively. Further audit of this will be needed.

Jonathan Turner Eagles added a comment - 14/Feb/14 02:43 Differences with the current patch Hadoop jars (not even client) exist in either tez-dist or tez-dist-full tez jars are now pulled up into top level from lib tez-dist now has direct dependencies of all tez jars in lib instead of direct dependencies of only tez-api there may be a difference in a jar missing or added based on the above depending on whether tez lists the dependency directly or was using it transitively. Further audit of this will be needed.

Jonathan Turner Eagles added a comment - 14/Feb/14 21:02

Hmm. Actually, I have found that not having the mapreduce client jars in the tez jar location in hdfs causes tez to have class not found issues. Perhaps I should leave these in until a later time. Is this use case supported currently or is there more work to be done.

Jon

Jonathan Turner Eagles added a comment - 14/Feb/14 21:02 Hmm. Actually, I have found that not having the mapreduce client jars in the tez jar location in hdfs causes tez to have class not found issues. Perhaps I should leave these in until a later time. Is this use case supported currently or is there more work to be done. Jon

Hitesh Shah added a comment - 14/Feb/14 21:16

jeagles At the moment, all the jars in the minimal tarball are augmented with the yarn classpath ( i.e. HDFS, COMMON and YARN jars only ) to create the runtime env.

Hitesh Shah added a comment - 14/Feb/14 21:16 jeagles At the moment, all the jars in the minimal tarball are augmented with the yarn classpath ( i.e. HDFS, COMMON and YARN jars only ) to create the runtime env.

Jonathan Turner Eagles added a comment - 14/Feb/14 21:31

Just to clarify so I can wrap this up soon. Tez augments only HDFS, COMMON, and YARN jars but NOT MAPREDUCE jars. If this is the case, then MAPREDUCE client jars will need to be added like they were before.

Jonathan Turner Eagles added a comment - 14/Feb/14 21:31 Just to clarify so I can wrap this up soon. Tez augments only HDFS, COMMON, and YARN jars but NOT MAPREDUCE jars. If this is the case, then MAPREDUCE client jars will need to be added like they were before.

Hitesh Shah added a comment - 14/Feb/14 21:38

jeagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars.

sseth any comments?

Hitesh Shah added a comment - 14/Feb/14 21:38 jeagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars. sseth any comments?

Jonathan Turner Eagles added a comment - 14/Feb/14 22:58

I have made the changes so the the mapreduce-client jars are made available. Any other thoughts?

Jonathan Turner Eagles added a comment - 14/Feb/14 22:58 I have made the changes so the the mapreduce-client jars are made available. Any other thoughts?

Hitesh Shah added a comment - 16/Feb/14 05:16

jeagles Sorry for the delay - was waiting on bikassaha or sseth to chime in. Could you shed some light on the removal of the client dependencies? For now, it should not be an issue assuming everything is compatible but at some point, there is a need to have the client jars also be part of the distribution.

Hitesh Shah added a comment - 16/Feb/14 05:16 jeagles Sorry for the delay - was waiting on bikassaha or sseth to chime in. Could you shed some light on the removal of the client dependencies? For now, it should not be an issue assuming everything is compatible but at some point, there is a need to have the client jars also be part of the distribution.

Jonathan Turner Eagles added a comment - 16/Feb/14 23:35

Initially I thought to remove the map reduce client libs as part of this jira, but then I thought to leave that decision until later. Hope that answers your question.

Jonathan Turner Eagles added a comment - 16/Feb/14 23:35 Initially I thought to remove the map reduce client libs as part of this jira, but then I thought to leave that decision until later. Hope that answers your question.

Hitesh Shah added a comment - 17/Feb/14 04:41

Sorry - should have been more clear. For mapreduce, i believe the current patch is retaining the required jars. I was wondering about the yarn-client dependency being removed?

Hitesh Shah added a comment - 17/Feb/14 04:41 Sorry - should have been more clear. For mapreduce, i believe the current patch is retaining the required jars. I was wondering about the yarn-client dependency being removed?

Siddharth Seth added a comment - 17/Feb/14 08:06

Jonathan Eagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars.

Right. The main JAR from MapReduce should be included in the tez distribution (client-core). We're pulling in Shuffle and Common as well currently. Need to investigate if these are mandatory.

On the YARN client dependencies - is relying on the versions deployed on the cluster sufficient. YARN, at least within the 2.x line, is supposed to be backward compatible ? This control should likely be left to the cluster administrators. If Tez were to ship YARN client libraries, and were to be deployed on a cluster with an older version of YARN (lets say with some missing methods) - that just doesn't work. At that point, we either ensure the correct version / rebuild Tez for specific versions of YARN. I'd really like to avoid going down the route of shimming in Tez to handle such situations though.

The remaining libraries are also interesting. Tez depends on certain libraries which are also used by Hadoop (Guava, HTTPClient etc). Should Tez be including a copy of these libraries in it's build - or should it rely on the copy included with Hadoop. Have seen cases in the recent past where some cleanup of unnecessary dependencies in Hadoop has caused Tez dist installs to fail. We may just be better shipping all direct dependencies, and supporting a Post install localization step for all nodes.

Siddharth Seth added a comment - 17/Feb/14 08:06 Jonathan Eagles Yes - the mapreduce user-facing jars should be included. I don't believe there is anything from the MR AM related jars. Right. The main JAR from MapReduce should be included in the tez distribution (client-core). We're pulling in Shuffle and Common as well currently. Need to investigate if these are mandatory. On the YARN client dependencies - is relying on the versions deployed on the cluster sufficient. YARN, at least within the 2.x line, is supposed to be backward compatible ? This control should likely be left to the cluster administrators. If Tez were to ship YARN client libraries, and were to be deployed on a cluster with an older version of YARN (lets say with some missing methods) - that just doesn't work. At that point, we either ensure the correct version / rebuild Tez for specific versions of YARN. I'd really like to avoid going down the route of shimming in Tez to handle such situations though. The remaining libraries are also interesting. Tez depends on certain libraries which are also used by Hadoop (Guava, HTTPClient etc). Should Tez be including a copy of these libraries in it's build - or should it rely on the copy included with Hadoop. Have seen cases in the recent past where some cleanup of unnecessary dependencies in Hadoop has caused Tez dist installs to fail. We may just be better shipping all direct dependencies, and supporting a Post install localization step for all nodes.

Jonathan Turner Eagles added a comment - 18/Feb/14 23:07

There are a lot of great ideas and forward thinking here. As far as 0.3.0 release is concerned, what aspects (or all) of this do we want to accomplish with this JIRA? I want to make sure we are all in agreement to ease this in.

Jonathan Turner Eagles added a comment - 18/Feb/14 23:07 There are a lot of great ideas and forward thinking here. As far as 0.3.0 release is concerned, what aspects (or all) of this do we want to accomplish with this JIRA? I want to make sure we are all in agreement to ease this in.

Siddharth Seth added a comment - 18/Feb/14 23:20

I'd say - the required MapReduce libraries, and other libraries for which we know that Tez has direct dependencies. For now I guess it's OK to depend on Hadoop for some of the dependencies ( Hadoop 2.2, 2.3 and 2.4 should work imo though). Thoughts ?

Siddharth Seth added a comment - 18/Feb/14 23:20 I'd say - the required MapReduce libraries, and other libraries for which we know that Tez has direct dependencies. For now I guess it's OK to depend on Hadoop for some of the dependencies ( Hadoop 2.2, 2.3 and 2.4 should work imo though). Thoughts ?

Siddharth Seth added a comment - 19/Feb/14 00:25

jeagles, Looked at the before and after diff. There's a bunch of jars like common-compress, jackson, stax, xz which were in the old build, but are missing from the new one. Do you have any idea why these were included previously and not now. I don't know if anything in Tez directly depends on these - but MapReduce dependencies may need to be pulled in (if that's where they were from).

Siddharth Seth added a comment - 19/Feb/14 00:25 jeagles , Looked at the before and after diff. There's a bunch of jars like common-compress, jackson, stax, xz which were in the old build, but are missing from the new one. Do you have any idea why these were included previously and not now. I don't know if anything in Tez directly depends on these - but MapReduce dependencies may need to be pulled in (if that's where they were from).

Jonathan Turner Eagles added a comment - 19/Feb/14 22:47

Attaching a much better dependency list. The thing to look at here is that tez-dist only includes direct dependencies (and no transitive) vs tez-dist-full includes both direct and transitive dependencies. The reason there is a difference from before is that tez-dist before really included only direct dependencies for items in the top level, but transitive dependencies for items in lib. Hence, due to the move of tez jars to the top level some transitive dependencies are no longer included. In this new version I have also made sure to compare identical versions of hadoop

Jonathan Turner Eagles added a comment - 19/Feb/14 22:47 Attaching a much better dependency list. The thing to look at here is that tez-dist only includes direct dependencies (and no transitive) vs tez-dist-full includes both direct and transitive dependencies. The reason there is a difference from before is that tez-dist before really included only direct dependencies for items in the top level, but transitive dependencies for items in lib. Hence, due to the move of tez jars to the top level some transitive dependencies are no longer included. In this new version I have also made sure to compare identical versions of hadoop

Siddharth Seth added a comment - 20/Feb/14 22:27

Committing this. Thanks for the effort and patience jeagles.

Between the current state and the post patch state - I think the only jar which will be missing will be netty (considering yarn / hdfs / common classpath). Afaik, netty is not used by Tez or the components of MapReduce that tez depends on. (can likely be removed from the hadoop build itself)

I'm going to open yet another follow up jira to try listing direct dependencies etc.

Siddharth Seth added a comment - 20/Feb/14 22:27 Committing this. Thanks for the effort and patience jeagles . Between the current state and the post patch state - I think the only jar which will be missing will be netty (considering yarn / hdfs / common classpath). Afaik, netty is not used by Tez or the components of MapReduce that tez depends on. (can likely be removed from the hadoop build itself) I'm going to open yet another follow up jira to try listing direct dependencies etc.

Siddharth Seth added a comment - 20/Feb/14 22:32

Committed to master.
Created TEZ-867 for follow up.

Siddharth Seth added a comment - 20/Feb/14 22:32 Committed to master. Created TEZ-867 for follow up.

Hitesh Shah added a comment - 01/Mar/14 04:19

Closing issue as 0.3.0 released.

Hitesh Shah added a comment - 01/Mar/14 04:19 Closing issue as 0.3.0 released.

Apache Tez

Clean up dist tarballs

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates