[FLUME-1030] Retry logic for failover sink processor to handle downstream exceptions in a predictable manner. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: Sinks+Sources
Labels:
None

Description

One may want to refer to ~~FLUME-984~~ for some history of this.

As it stands, a sink can have several outcomes:

OK - succesfully transferred some data
TRY_LATER - no data to transfer
throw EventDeliveryException - Give the sink a short breather to recover, then try again
throw anything else - get logged and more or less ignored

I don't think the last choice in particular is a good idea as it encourages throwing Sink specific exceptions. Further, there is no distinction between temporary disconnectivity(e.g. HBase timed out because of a compaction or something), and more permanent problems(e.g. cannot write to a file).

One solution to this is to add a second type of exception that delivery mechanisms can throw, ConnectivityException/FatalException or something similar. For the purposes of any failover/load balancing mechanism this would signal that a component is out of order for a more significant amount of time and thus constant polling should be stopped(perhaps retry it every 5 minutes instead, or have an exponentially increasing retry time).

If adding another exception is not deemed acceptable, there is always the possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. counting sequential failures, though I do not think this is ideal. I would prefer to see a clear contract defined by SinkRunner that well behaved sinks could adhere to and get the benefits of graceful temporary/longterm failure from.

If someone has other suggestions for distinguishing between temporary and longer term failure please let me know. As it stands, components that are unresponsive can and do get called constantly, and some components trigger retries and can actually block a SinkRunner thread for a fair while.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

FLUME-1030.2.patch
23/Mar/12 02:14
7 kB
Juhani Connolly
FLUME-1030.3.patch
23/Mar/12 02:41
7 kB
Juhani Connolly
FLUME-1030.4.patch
23/Mar/12 05:30
9 kB
Juhani Connolly

Activity

People

Assignee:: Juhani Connolly

Reporter:: Juhani Connolly

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 15/Mar/12 00:29

Updated:: 23/Mar/12 17:16

Resolved:: 23/Mar/12 16:53