Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45138

Define a new error class and apply for the case where checkpointing state to DFS fails




      [From Neil Ramswamy]:

      When there is an exception during storing state from DFS (Hadoop, object store, etc), we just propagate the exception without the context. There is a context if customers look into stack trace, but that is definitely not something we want customers to look into by themselves.

      For example, let’s say file system related exception happened during checkpointing state. Since we just let the exception be bubbled up to the very top without adding any context, it is quite confusing for customers to quickly indicate whether there is an issue with source/sink (if source/sink data source is based on file), or offset/commit log, or state load/checkpoint. Too many operations can throw the same exception.

      The ticket aims to wrap the exception during the commit of the state, to assign error class properly. With assigning error class, we can classify the errors which help us to determine what errors customers are struggling much.

      StateStore.commit() is the entry point. Each state store provider has its own implementation, but use the same error class across implementations, as we want to categorize it as the same. Even better if we can put it to the higher-level of caller, but would be OK to handle it in built-in impls.


        Issue Links



              neilramaswamy Neil Ramaswamy
              rangadi Raghu Angadi
              0 Vote for this issue
              2 Start watching this issue