[SPARK-45138] Define a new error class and apply for the case where checkpointing state to DFS fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Structured Streaming
Labels:
- pull-request-available

Description

[From Neil Ramswamy]:

When there is an exception during storing state from DFS (Hadoop, object store, etc), we just propagate the exception without the context. There is a context if customers look into stack trace, but that is definitely not something we want customers to look into by themselves.

For example, let’s say file system related exception happened during checkpointing state. Since we just let the exception be bubbled up to the very top without adding any context, it is quite confusing for customers to quickly indicate whether there is an issue with source/sink (if source/sink data source is based on file), or offset/commit log, or state load/checkpoint. Too many operations can throw the same exception.

The ticket aims to wrap the exception during the commit of the state, to assign error class properly. With assigning error class, we can classify the errors which help us to determine what errors customers are struggling much.

StateStore.commit() is the entry point. Each state store provider has its own implementation, but use the same error class across implementations, as we want to categorize it as the same. Even better if we can put it to the higher-level of caller, but would be OK to handle it in built-in impls.

Attachments

Issue Links

links to

GitHub Pull Request #42895

Activity

People

Assignee:: Neil Ramaswamy

Reporter:: Raghu Angadi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Sep/23 21:56

Updated:: 21/Sep/23 03:24

Resolved:: 21/Sep/23 03:01