Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: Spark Core
We have run into an issue where our history server fails to load new applications, and when restarted, fails to load any applications at all. This happens when it encounters invalid rolling event log files. We encounter this with long running streaming applications. There seems to be two issues here that lead to problems:
- It looks like our long running streaming applications event log directory is being cleaned up. The next time the application logs event data, it recreates the event log directory but without recreating the "appstatus" file. I don't know the full extent of this behavior or if something "wrong" is happening here.
- The history server then reads this new folder, and throws an exception because the "appstatus" file doesn't exist in the rolling event log folder. This exception breaks the entire listing process, so no new applications will be read, and if restarted no applications at all will be read.
There seems like a couple ways to go about fixing this, and I'm curious anyone's thoughts who knows more about how the history server works, specifically with rolling event logs:
- Don't completely fail checking for new applications if one bad rolling event log folder is encountered. This seems like the simplest fix and makes sense to me, it already checks for a few other errors and ignores them. It doesn't necessarily fix the underlying issue that leads to this happening though.
- Figure out why the in progress event log folder is being deleted and make sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't want to delete the top level folder and only delete event log files within the folders? Again I don't know the exact current behavior here with this.
- When writing new event log data, make sure the folder and appstatus file exist every time, creating them again if not.
Here's the stack trace we encounter when this happens, from 3.0.1 with a couple extra MRs backported that I hoped would fix the issue:
2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updates2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updatesjava.lang.IllegalArgumentException: requirement failed: Log directory must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:347) at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)