I want to clarify a bit more about what I think and I really need you opinion on this bit. Regarding temp file creation due to DataBag spill, this can happen in 2 places:-
- In Hadoop Map Reduce execution engine
- In Local execution engine
I agree with you that the working dir mechanism in hadoop is already good and you're trying to adopt it BUT what about local execution engine?
I think even most people pay more attention on Hadoop backend and that's where Pig started, but the local engine still has its use.
A sample use case would be if I have a big data file on my harddisk(thus cannot be too big) and what I do is I just download Pig and then quickly write a pig script to perform processing in my local machine using local execution engine (without running Hadoop)
A good local engine implementation will help improve usability of Pig!!!
Can we handle this issue in 2 different ways? One for hadoop backend, one for local engine. I'm willing to implement what I've proposed in the last comment for the local engine.