YARN-4696 contains my current logic to handle failures to parse things. :
If the JSON parser fails then an info message is printed if we know the file is non-empty (i.e. either length>0 or offset > 0)
I think there are some possible race conditions in the code as is, certainly FNFEs ought to downgrade to info,
For other IOEs, I think they should be caught & logged per file, rather than stop the entire scan loop. Otherwise bad permissions on one file would be enough to break the scanning.
Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, am trying to disable caching in file://, but am close to accepting defeat and spinning up a single mini yarn cluster across all my test cases. That or add a config option to turn off checksumming in localFS. The logic is there, but you can only set it in an FS instance which must be used directly or propagated to the code-under-test via the FS cache.
The local FS does work for picking up completed work; the problem is that as flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And when it does, unless the JSON is aligned on a buffer boundary, the parser is going to fail, which is going to lead to lots and lots of info messages, unless the logging is tuned further to only log if the last operation was not a failure.
We only need to really worry about other cross-cluster filesystems for production use here. Single node with local FS? Use the 1.0 APIs. Production: Distributed FS which is required to implement flush() (even a delayed/async flush) if you want to see incomplete applications. I believe GlusterFS supports that, as does any POSIX FS if the checksum FS doesn't get in the way. What does jay vyas have to say about his filesystem's consistency model?
It will mean that the object stores, S3 and swift can't work as destinations for logs. They are dangerous anyway as if the app crashes before out.close() is called all data is lost. If we care about that, then you'd really want to write to an FS (local or HDFS) then copy to the blobstore for long-term histories.