Can we keep a single bundle for the hive internal pieces? I think that's orthogonal to the caching effort and seems more efficient to me than breaking it all into smaller bits and also let's us shade what needs shading. It also doesn't change how we handle these things as drastically.
The issue, as I understand it, is that Hadoop unjars the jar it ships to the cluster by default. In Hive's case this is the hive-exec jar. I got this from here: https://issues.apache.org/jira/browse/PIG-2672?focusedCommentId=13263874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13263874 and here https://issues.apache.org/jira/browse/HCATALOG-385
which means that having a large hive-exec jar causes a large penalty for each query. That is one of the reasons I undid the uber hive-exec jar. The other reason is that it's a bad practice to build uber jars as the main artifact for a module. One of the issues is that users cannot use their own version of the libraries packed into the uber jar and in fact it's often a frustrating ordeal to figure out why your version of the library is not taking affect.
For these reasons I think we should make the uber jar as small as possible and include only things are specifically shading.
Seems in pig they made the caching optional - can we do that too? In case someone has issues with caching it in the user directory?
This is a good idea, I will update the patch to do this.
Finally a thought for file formats:. It would be nice to only pull the dependencies when they are actually needed not every time you run a query. That way you're not penalized for adding as many as you want and external serdes can play too. We could extend the serde API with an optional call to retrieve additional jars to be localized.
I agree, this would be ideal. I think it's future work though. This change speeds up queries by reusing the majority of the old hive-exec jar on each query so I don't want to hold off on "good" for "best".