A few comments and questions on this:
1) We should make this work against the load/store branch instead of trunk. We're hoping to merge load/store into trunk in a week or two, so it makes more sense to put it there. This will also have implications for load/store. One, it will need to communicate to the new validate function that it's ok if the file (or whatever is being overwritten) exists. Two, load implementations will need to handle removing the file (or whatever) if necessary. For example, PigStorage will need to handle removing the file so MR doesn't complain.
2) Should we have overwrite be a keyword (as originally proposed and in the patch) or should it be string, like hints in join? I don't have a strong opinion one way or another but I think it's worth considering which we want.
3) Is the semantic of overwrite that it saves whether the file is there or not, or that it's an error if the file is not there to write? Write whether there or not makes more sense to me, but I wanted to make sure we all agree on it.
4) What happens when a user requests overwrite and the job fails before it runs? In the current implementation the file will be removed up front, so any planning errors will still result in the file being removed. Also, the file will be removed up front, even if the job remains in Hadoop's queue for a long time waiting to run. At the very least, I think Pig should delay removing the file until it is ready to launch the job so that type checking errors or whatever don't result in the file being removed when the job is not run.