Thanks for doing this benchmark and providing the analysis of the factors that affect the overall performance. This is very useful.
One of the reasons for creating a new language for pig was to enable users to express more optimal query plan query itself. It enables you to express the optimizations mentioned in 2,3,4 and 5 in language itself. This is very useful feature of pig, because even if the optimizer is very good, there will be cases where it does the wrong thing. Also, it will be some time before a good cost based optimizer is going to be available for pig.
As you mention, pig now only has a rule based optimizer, which runs the rules that should improve performance in almost all cases. The rules 2 - 5 that you mention should improve performance in almost all cases, so it makes sense to implement those rules in pig.
Regarding 5, the work on lazy de-serialization done in
PIG-2359 is going to be useful.
Regarding optimization of join followed by group-by, even if the join and group keys are different, hash-based aggregation can be used to reduce the size of output written to HDFS from the MR job for join, by doing the partial aggregation the reduce.
For the case where a join and group have same keys, the pig optimizer re-writing the query into a co-group operation might be easiest thing to do.
I don't think the optimization tips 2,3 are there in the pig documentation, it makes sense to document these. I will open another jira to address that.
I am really looking forward to see the results with pig 0.10 branch (http://svn.apache.org/repos/asf/pig/branches/branch-0.10) (with hash-based agg enabled).