I agree with Dmitriy that this will be very useful to most users with pretty minimal cost. My concern is in the glob case, where we're potentially doing thousands of stats on the NameNode. I would suggest adding a cap on the number of directories it could read, and providing a variable users could set to up this if they need to. For example, if a glob tried to access more than 100 directories, it would fail with a message like:
Error: PigStorage exceeded max number of input directories. To avoid this, you can turn of auto schema detection by setting what.ever.the.variable.is to false or you can increase the maximum allowed directories by setting what.ever.that.variable.is (warning, this will increase the load on your NameNode).
Olga, I don't understand your concern for backward compatibility. If the user has both a schema and an as clause we try to massage the schema into the as clause. The only issue will be if they store it with a schema and then give an as clause that is not compatible by our casting rules (e.g. the schema says a field is a long and they declare it as a string in the as clause). Do you think that case is common?