Description
One of my very complex queries takes about 14 minutes to compile with PPD on. Profiling it I saw a lot of time spent in this stack which is called many many thousands of times.
java.lang.Throwable.getStackTraceElement(-2) java.lang.Throwable.getOurStackTrace(827) java.lang.Throwable.getStackTrace(816) sun.reflect.GeneratedMethodAccessor5.invoke(-1) sun.reflect.DelegatingMethodAccessorImpl.invoke(43) java.lang.reflect.Method.invoke(497) org.apache.log4j.spi.LocationInfo.<init>(139) org.apache.log4j.spi.LoggingEvent.getLocationInformation(253) org.apache.log4j.helpers.PatternParser$LocationPatternConverter.convert(500) org.apache.log4j.helpers.PatternConverter.format(65) org.apache.log4j.PatternLayout.format(506) org.apache.log4j.WriterAppender.subAppend(310) org.apache.log4j.DailyRollingFileAppender.subAppend(369) org.apache.log4j.WriterAppender.append(162) org.apache.log4j.AppenderSkeleton.doAppend(251) org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(66) org.apache.log4j.Category.callAppenders(206) org.apache.log4j.Category.forcedLog(391) org.apache.log4j.Category.log(856) org.apache.commons.logging.impl.Log4JLogger.info(176) org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.logExpr(707) org.apache.hadoop.hive.ql.ppd.OpProcFactory$DefaultPPD.mergeWithChildrenPred(752) org.apache.hadoop.hive.ql.ppd.OpProcFactory$FilterPPD.process(437)
logExpr is set to log at INFO level, but I think DEBUG is more appropriate. When I set log level to debug I see > 20% speedup in compile time:
Before:
real 14m47.972s user 15m25.609s sys 0m20.282s
After:
real 11m30.946s user 12m10.870s sys 0m7.320s
It looks like there's a lot of stuff in the PPD code that could be optimized, when I turn PPD off the query compiles in 2m 30s. But this seems like an easy and low risk win.