A few thoughts:
In a job that is going to process a billion rows and run for 3 hours 1 bad row should not cause the whole job to fail.
This invalid access should certainly cause a warning. Users can look at the warnings at the end of the query and decide they do not want to keep the output because of the warnings. But failure should not be the default case (see previous point). Perhaps we should have a warnings = error option like compilers do so users who are very worried about the warnings can make sure they fail. But that's a different proposal for a different JIRA.
Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators.
I don't follow. Nulls in the pipeline shouldn't cause a problem. UDFs and operators need to be able to handle null values whether they come from processing or from the data itself.
Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way.
One could argue that it can't be assumed the user wants his query to fail when a field is missing. We have to assume one way or another. Null is a better assumption than failure, since it is possible for a user who doesn't want that behavior to detect it and deal with it. As it is now, the user has to modify his data or write a new load function to deal with padding his data.
I agree with you that in the schema case, it would be ideal if not having a field was an error. However, given the architecture this is difficult. And stipulating that load functions test every record to assure it matches the schema is too much of a performance penalty. But for the non-schema case I don't agree. Pig's philsophy of "Pigs eat anything" doesn't mean much if Pig gags as soon as it gets a record that doesn't match it's expectation.