Details

    • Type: Sub-task
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: timelineserver
    • Labels:
      None

      Description

      During one of my own timeline test runs, I've been seeing a stack trace warning that the CRC check failed in Filesystem.open() file; something the FS was ignoring.

      Even though its swallowed (and probably not the cause of my test failure), looking at the code in LogInfo.parsePath() that it considers a failure to open a file as unrecoverable.

      on some filesystems, this may not be the case, i.e. if its open for writing it may not be available for reading; checksums maybe a similar issue.

      Perhaps a failure at open() should be viewed as recoverable while the app is still running?

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          stack attached.

          Looking at the code, I'm not sure what the retry policy here should be. FNFE, auth failures? recoverable or not? it's hard to predict.

          2016-02-19 18:46:32,641 [EntityLogPluginWorker #1] WARN  org.apache.hadoop.fs.FSInputChecker (ChecksumFileSystem.java:<init>(157)) - Problem opening checksum file: file:/Users/stevel/spark-timeline-integration/target/tmp/ats/active/application_1111_0000/appattempt_1111_0000_000000/summarylog-appattempt_1111_0000_000000.  Ignoring exception: 
          java.io.EOFException
          	at java.io.DataInputStream.readFully(DataInputStream.java:197)
          	at java.io.DataInputStream.readFully(DataInputStream.java:169)
          	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:149)
          	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
          	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:775)
          	at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:130)
          	at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:110)
          	at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:584)
          	at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:560)
          	at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:747)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:745)
          2016-02-19 18:46:32,688 [EntityLogPluginWorker #1] INFO  org.apache.hadoop.yarn.server.timeline.LogInfo (LogInfo.java:parseForStore(112)) - Parsed 0 entities from file:/Users/stevel/Projects/spark-timeline-integration/target/tmp/ats/active/application_1111_0000/appattempt_1111_0000_000000/summarylog-appattempt_1111_0000_000000 in 50 msec
          2016-02-19 18:46:32,689 [EntityLogPluginWorker #1] DEBUG org.apache.hadoop.yarn.server.timel
          
          Show
          stevel@apache.org Steve Loughran added a comment - stack attached. Looking at the code, I'm not sure what the retry policy here should be. FNFE, auth failures? recoverable or not? it's hard to predict. 2016-02-19 18:46:32,641 [EntityLogPluginWorker #1] WARN org.apache.hadoop.fs.FSInputChecker (ChecksumFileSystem.java:<init>(157)) - Problem opening checksum file: file:/Users/stevel/spark-timeline-integration/target/tmp/ats/active/application_1111_0000/appattempt_1111_0000_000000/summarylog-appattempt_1111_0000_000000. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:149) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:775) at org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:130) at org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:110) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:584) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:560) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:747) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:745) 2016-02-19 18:46:32,688 [EntityLogPluginWorker #1] INFO org.apache.hadoop.yarn.server.timeline.LogInfo (LogInfo.java:parseForStore(112)) - Parsed 0 entities from file:/Users/stevel/Projects/spark-timeline-integration/target/tmp/ats/active/application_1111_0000/appattempt_1111_0000_000000/summarylog-appattempt_1111_0000_000000 in 50 msec 2016-02-19 18:46:32,689 [EntityLogPluginWorker #1] DEBUG org.apache.hadoop.yarn.server.timel
          Hide
          stevel@apache.org Steve Loughran added a comment -

          .. adding a check to skip 0-byte files stops the stack I've been seeing from coming back...

          Show
          stevel@apache.org Steve Loughran added a comment - .. adding a check to skip 0-byte files stops the stack I've been seeing from coming back...
          Hide
          jlowe Jason Lowe added a comment -

          ATS 1.5 relies on reading files that are actively being written, and there are inherent errors that can occur during those scenarios. I believe the error above is aborting the current attempt to read the summary file, but it should be trying again shortly later. For active applications it essentially is polling for more events and will keep trying to read both the summary file and detailed file(s), seeking past the amount of data successfully read so far.

          Show
          jlowe Jason Lowe added a comment - ATS 1.5 relies on reading files that are actively being written, and there are inherent errors that can occur during those scenarios. I believe the error above is aborting the current attempt to read the summary file, but it should be trying again shortly later. For active applications it essentially is polling for more events and will keep trying to read both the summary file and detailed file(s), seeking past the amount of data successfully read so far.
          Hide
          jlowe Jason Lowe added a comment -

          adding a check to skip 0-byte files stops the stack I've been seeing from coming back...

          The issue with checking the file size first is that the ATS could refuse to open a file that has data in it, adding extra delay between the application writing the data and the data appearing in the ATS. Since we're trying to read a file that is still being written the file sizes may not reflect reality, especially in file systems like HDFS where the file size is only updated when new blocks are allocated.

          Show
          jlowe Jason Lowe added a comment - adding a check to skip 0-byte files stops the stack I've been seeing from coming back... The issue with checking the file size first is that the ATS could refuse to open a file that has data in it, adding extra delay between the application writing the data and the data appearing in the ATS. Since we're trying to read a file that is still being written the file sizes may not reflect reality, especially in file systems like HDFS where the file size is only updated when new blocks are allocated.
          Hide
          gtCarrera9 Li Lu added a comment -

          Seems we need to somewhat handle this? This will happen when a scan happens after an empty file is created but nothing inside. Once we hit this problem we will throw this exception. To me preventing this from happening looks more appealing than having a WARN for this problem? One RPC call to check the file size shouldn't be a big problem in general.

          Show
          gtCarrera9 Li Lu added a comment - Seems we need to somewhat handle this? This will happen when a scan happens after an empty file is created but nothing inside. Once we hit this problem we will throw this exception. To me preventing this from happening looks more appealing than having a WARN for this problem? One RPC call to check the file size shouldn't be a big problem in general.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          That's what confuses me. After a scan of an empty file/failed parse, it gets loaded again, next scan round? Or is it removed from the scan list?. Really a failure to parse the JSON or an empty file should be treated the same: try later if the file size increases (I found that to be a better metric when dealing with cached filesystem data in the spark history server). If the FS is buffering after a flush, then it may only save the last block, so the JSON won't parse fully at first —which is why that needs to retry too

          Show
          stevel@apache.org Steve Loughran added a comment - That's what confuses me. After a scan of an empty file/failed parse, it gets loaded again, next scan round? Or is it removed from the scan list?. Really a failure to parse the JSON or an empty file should be treated the same: try later if the file size increases (I found that to be a better metric when dealing with cached filesystem data in the spark history server). If the FS is buffering after a flush, then it may only save the last block, so the JSON won't parse fully at first —which is why that needs to retry too
          Hide
          jlowe Jason Lowe added a comment -

          One RPC call to check the file size shouldn't be a big problem in general.

          As I mentioned above, we cannot rely on the file size to be accurate. The file is being actively written, and there's no guarantee the file size will be updated in a timely manner after data is written. There can be data in the file for hours and the file size could still be zero. In HDFS it will only be updated when the next block is allocated, so it could sit at filesize of 0 for a very long time (depending upon how fast the writer is going) until the filesize suddenly jumps to the blocksize when the writer passes the first block boundary. The only real way to know how much data is in the file is to read it – we cannot rely on what the namenode reports.

          After a scan of an empty file/failed parse, it gets loaded again, next scan round? Or is it removed from the scan list?

          The file is always revisted, errors or not, on the next scan round as long as the application is active. It opens the file then seeks to the last successfully read byte offset and tries to read more. If data is successfully read then it updates the byte offset for the next round, rinse, repeat.

          Really a failure to parse the JSON or an empty file should be treated the same: try later if the file size increases

          Again, we cannot rely on the file size to be updated. To reduce load on the namenode, the writer is simply pushing the data out to the datanode – it's not also making an RPC call to the namenode to update the filesize. The only actors involved are the writer, the datanode, and the reader. The namenode is oblivious to what's happening until the next block is allocated, which could take a really long time if the writer is writing slowly. Note that for these files a slow writer is not a rare case, as it only writes when tasks change state.

          I agree we need to handle this better, probably by making the error a bit less scary in the log.

          Show
          jlowe Jason Lowe added a comment - One RPC call to check the file size shouldn't be a big problem in general. As I mentioned above, we cannot rely on the file size to be accurate. The file is being actively written, and there's no guarantee the file size will be updated in a timely manner after data is written. There can be data in the file for hours and the file size could still be zero. In HDFS it will only be updated when the next block is allocated, so it could sit at filesize of 0 for a very long time (depending upon how fast the writer is going) until the filesize suddenly jumps to the blocksize when the writer passes the first block boundary. The only real way to know how much data is in the file is to read it – we cannot rely on what the namenode reports. After a scan of an empty file/failed parse, it gets loaded again, next scan round? Or is it removed from the scan list? The file is always revisted, errors or not, on the next scan round as long as the application is active. It opens the file then seeks to the last successfully read byte offset and tries to read more. If data is successfully read then it updates the byte offset for the next round, rinse, repeat. Really a failure to parse the JSON or an empty file should be treated the same: try later if the file size increases Again, we cannot rely on the file size to be updated. To reduce load on the namenode, the writer is simply pushing the data out to the datanode – it's not also making an RPC call to the namenode to update the filesize. The only actors involved are the writer, the datanode, and the reader. The namenode is oblivious to what's happening until the next block is allocated, which could take a really long time if the writer is writing slowly. Note that for these files a slow writer is not a rare case, as it only writes when tasks change state. I agree we need to handle this better, probably by making the error a bit less scary in the log.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          OK. so HDFS has guaranteed flush but no guarantees on modtime or size propagation; in contrast, the local file:// FS is consistent with FileStatus.length and actual length, but doesn't flush when told to, so can delay its writes until a CRC-worth of data has been written —and there is no obvious way to turn this off for testing via config files.

          On HDFS then: empty files length can't be interpreted as a reason to skip; so failures to read are an error. An attempt must be made to read it, but any EOFexception or similar is not a failure. That is: you can't skip on empty, just swallow the failure. Maybe at DEBUG list the exception and current file status value. or just attempt to read() byte 0 after opening file; an EOFException means "still empty"

          That essentially means that until such a switch is provided, you cannot use the localfs as a back end for ATS1.5 —even for testing. Or at least, you can write with it, but the data won't be guaranteed to be visible until close() is called. You may not get any view of incomplete apps —which is precisely what I'm seeing.

          If this is the case, then that's something ATS1.5 can't fix: it will have to be in the documentation.

          Show
          stevel@apache.org Steve Loughran added a comment - OK. so HDFS has guaranteed flush but no guarantees on modtime or size propagation; in contrast, the local file:// FS is consistent with FileStatus.length and actual length, but doesn't flush when told to, so can delay its writes until a CRC-worth of data has been written —and there is no obvious way to turn this off for testing via config files. On HDFS then: empty files length can't be interpreted as a reason to skip; so failures to read are an error. An attempt must be made to read it, but any EOFexception or similar is not a failure. That is: you can't skip on empty, just swallow the failure. Maybe at DEBUG list the exception and current file status value. or just attempt to read() byte 0 after opening file; an EOFException means "still empty" That essentially means that until such a switch is provided, you cannot use the localfs as a back end for ATS1.5 —even for testing. Or at least, you can write with it, but the data won't be guaranteed to be visible until close() is called. You may not get any view of incomplete apps —which is precisely what I'm seeing. If this is the case, then that's something ATS1.5 can't fix: it will have to be in the documentation.
          Hide
          jlowe Jason Lowe added a comment -

          Another workaround could be a scheme to get a raw local filesystem instead of one with a CRC and use that for the unit tests, e.g.: something like rawfile:// instead of file://. Then flushes would work as ATS1.5 expects.

          Agree that we could move the errors encountered during "live" file reads to debug to avoid scaring users and document that ATS1.5 relies on flush actually flushing which means it can't be used with the crc'd local filesystem.

          Show
          jlowe Jason Lowe added a comment - Another workaround could be a scheme to get a raw local filesystem instead of one with a CRC and use that for the unit tests, e.g.: something like rawfile:// instead of file:// . Then flushes would work as ATS1.5 expects. Agree that we could move the errors encountered during "live" file reads to debug to avoid scaring users and document that ATS1.5 relies on flush actually flushing which means it can't be used with the crc'd local filesystem.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          yeah, thought of that... tried patch file:// but it didn't take. I'll delve some more. IMO we should consider something like rawfile:// everywhere, but the windows URI handling stuff may turn out to be extra fun there.

          Show
          stevel@apache.org Steve Loughran added a comment - yeah, thought of that... tried patch file:// but it didn't take. I'll delve some more. IMO we should consider something like rawfile:// everywhere, but the windows URI handling stuff may turn out to be extra fun there.
          Hide
          jlowe Jason Lowe added a comment -

          Agreed that rawfile:// could prove useful outside of this scenario and probably makes sense to have in common. Seems like it should be very straightforward to add the new scheme and direct it to RawLocalFileSystem, but I'm not familiar with how windows could mess with that.

          Show
          jlowe Jason Lowe added a comment - Agreed that rawfile:// could prove useful outside of this scenario and probably makes sense to have in common. Seems like it should be very straightforward to add the new scheme and direct it to RawLocalFileSystem, but I'm not familiar with how windows could mess with that.
          Hide
          gtCarrera9 Li Lu added a comment -

          Once the WARN log is gone I'm fine with it. One quick question is, are we planning to use rawfile to test ATS v1.5? Will there be any differences on the semantic side between raw file system and HDFS? I think it's fine if the test only relates to the upper level logic of ATS v1.5, but we may need to be careful to test the storage layer with a different FS?

          Show
          gtCarrera9 Li Lu added a comment - Once the WARN log is gone I'm fine with it. One quick question is, are we planning to use rawfile to test ATS v1.5? Will there be any differences on the semantic side between raw file system and HDFS? I think it's fine if the test only relates to the upper level logic of ATS v1.5, but we may need to be careful to test the storage layer with a different FS?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          YARN-4696 contains my current logic to handle failures to parse things. :

          If the JSON parser fails then an info message is printed if we know the file is non-empty (i.e. either length>0 or offset > 0)

          I think there are some possible race conditions in the code as is, certainly FNFEs ought to downgrade to info,

          For other IOEs, I think they should be caught & logged per file, rather than stop the entire scan loop. Otherwise bad permissions on one file would be enough to break the scanning.

          Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, am trying to disable caching in file://, but am close to accepting defeat and spinning up a single mini yarn cluster across all my test cases. That or add a config option to turn off checksumming in localFS. The logic is there, but you can only set it in an FS instance which must be used directly or propagated to the code-under-test via the FS cache.

          The local FS does work for picking up completed work; the problem is that as flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And when it does, unless the JSON is aligned on a buffer boundary, the parser is going to fail, which is going to lead to lots and lots of info messages, unless the logging is tuned further to only log if the last operation was not a failure.

          We only need to really worry about other cross-cluster filesystems for production use here. Single node with local FS? Use the 1.0 APIs. Production: Distributed FS which is required to implement flush() (even a delayed/async flush) if you want to see incomplete applications. I believe GlusterFS supports that, as does any POSIX FS if the checksum FS doesn't get in the way. What does jay vyas have to say about his filesystem's consistency model?

          It will mean that the object stores, S3 and swift can't work as destinations for logs. They are dangerous anyway as if the app crashes before out.close() is called all data is lost. If we care about that, then you'd really want to write to an FS (local or HDFS) then copy to the blobstore for long-term histories.

          Show
          stevel@apache.org Steve Loughran added a comment - YARN-4696 contains my current logic to handle failures to parse things. : If the JSON parser fails then an info message is printed if we know the file is non-empty (i.e. either length>0 or offset > 0) I think there are some possible race conditions in the code as is, certainly FNFEs ought to downgrade to info, For other IOEs, I think they should be caught & logged per file, rather than stop the entire scan loop. Otherwise bad permissions on one file would be enough to break the scanning. Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, am trying to disable caching in file:// , but am close to accepting defeat and spinning up a single mini yarn cluster across all my test cases. That or add a config option to turn off checksumming in localFS. The logic is there, but you can only set it in an FS instance which must be used directly or propagated to the code-under-test via the FS cache. The local FS does work for picking up completed work; the problem is that as flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And when it does, unless the JSON is aligned on a buffer boundary, the parser is going to fail, which is going to lead to lots and lots of info messages, unless the logging is tuned further to only log if the last operation was not a failure. We only need to really worry about other cross-cluster filesystems for production use here. Single node with local FS? Use the 1.0 APIs. Production: Distributed FS which is required to implement flush() (even a delayed/async flush) if you want to see incomplete applications. I believe GlusterFS supports that, as does any POSIX FS if the checksum FS doesn't get in the way. What does jay vyas have to say about his filesystem's consistency model? It will mean that the object stores, S3 and swift can't work as destinations for logs. They are dangerous anyway as if the app crashes before out.close() is called all data is lost. If we care about that, then you'd really want to write to an FS (local or HDFS) then copy to the blobstore for long-term histories.
          Hide
          jayunit100 jay vyas added a comment -

          Ah the GlusterFS consistency model ? From my experience its not strongly consistent all the time in all cases. I'd cc Huamin Chen and @childsb as well on this ... they are currently working on these filesystems.

          Show
          jayunit100 jay vyas added a comment - Ah the GlusterFS consistency model ? From my experience its not strongly consistent all the time in all cases. I'd cc Huamin Chen and @childsb as well on this ... they are currently working on these filesystems.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          All we need to know is "does flush() write data back so that other code can eventually see it?"

          Show
          stevel@apache.org Steve Loughran added a comment - All we need to know is "does flush() write data back so that other code can eventually see it?"

            People

            • Assignee:
              Unassigned
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development