From a frist look into the code I guess this is what happens.
The data are stored successfully on disk the first time you call store. So PoStore adds an entry to materializedResults.
What is basically a hashmap that holds OperatorKey - just a name and LocalResult - a pointer to the file you just wrote.
If you now trigger store again for the same alias, pig tries to optimize performance bt reusing the output file you just stored.
This happens by first check if there is already materializedResults entry.
What is the case - so in theory this could be reused just read and writte again to a new path.
Now there are a couple of problems. First in your testcase you delete the output file (/tmp/testPigOutput) but pig tries to read in this file again to write it out again. What means you read and write at the same time into the same file. Another problem in your test you delete this file between the store calls , so it can't be read back.
Now a pig come in. Pig tries to read back in this file with the same object you used for storing this file.
So the object need to implement LoadFunc und StoreFunc, what is not the case in your test you only implement storefunc, what makes sense from my pov. See POLoad, line 57,
lf = (LoadFunc) PigContext.instantiateFuncFromSpec(fileSpec.getFuncSpec()); // the return value can be a StoreFunc only as well.
This worked so far since most of the StoreFunc and LoadFunc are implemented in one class, but not a good idea.
So now the question to the pig developers, how we can solve that problem?
Only cache materialized files in case we do have a load and a store func available?
Re process all required plans in case we can not load a materialized result?