Not too many code paths.
Sure there are. Both Pig and HBase are replicating the behavior of ToolRunner's libjars argument for including jars with a job. They do so in slightly different ways, but thus we have 3 different code-paths. I'd prefer consolidation on a single code-path.
Filters out pig and hadoop classes from the list of classes so that pig and hadoop jar are not included.
We can add a method, something like addHBaseDependencyJars(Job) which will add only HBase and it's dependency jars (currently: zookeeper, protobuf, guava), nothing else. That way, we're not including any redundant Pig or Hadoop jars and HBase is managing it's own dependencies (meaning Pig won't have to change every time we change something). This is effectively the same doing what you say above, "Also ensure that you add HTable.class apart from Zookeeper, inputformat, input/output key/value, partitioner and combiner classes," that is, omitting inputformat, keys, values, partitioner, combiner. Does that sound like it'll accomplish what this filter intends?
Find the jars for the other classes and filter out any jars already present in PigContext.extrajars and add only the rest to tmpjars.
How do we access the PigContext? Is it in the jobConf or some such? I'd rather not put Pig-specific code in the bowels of HBase mapreduce code; my preference is to build generic APIs that can be used across the board.
HBase APIs are designed to assist people writing raw MR jobs against HBase (ie, including key/value classes, input/output format classes, &c). The slightly different requirements of Pig and Hive need to be addressed as well.