[FLINK-10289] Classify Exceptions to different category for apply different failover strategy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.7.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

We need to classify exceptions and treat them with different strategies. To do this, we propose to introduce the following Throwable Types, and the corresponding exceptions:

NonRecoverable
- We shouldn’t retry if an exception was classified as NonRecoverable
- For example, NoResouceAvailiableException is a NonRecoverable Exception
- Introduce a new Exception UserCodeException to wrap all exceptions that throw from user code

PartitionDataMissingError
- In certain scenarios producer data was transferred in blocking mode or data was saved in persistent store. If the partition was missing, we need to revoke/rerun the produce task to regenerate the data.
- Introduce a new exception PartitionDataMissingException to wrap all those kinds of issues.

EnvironmentError
- It happened due to hardware, or software issues that were related to specific environments. The assumption is that a task will succeed if we run it in a different environment, and other task run in this bad environment will very likely fail. If multiple task failures in the same machine due to EnvironmentError, we need to consider adding the bad machine to blacklist, and avoiding schedule task on it.
- Introduce a new exception EnvironmentException to wrap all those kind of issues.

Recoverable
- We assume other issues are recoverable.

Attachments

Issue Links

relates to

FLINK-6227 Introduce the abstract PartitionException for downstream task failure

Closed

links to

GitHub Pull Request #6739

Activity

People

Assignee:: JIN SUN

Reporter:: JIN SUN

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 06/Sep/18 07:21

Updated:: 29/Mar/19 12:53

Resolved:: 09/Oct/18 08:19