[SPARK-9947] Separate Metadata and State Checkpoint Data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.4.1
Fix Version/s: None
Component/s: DStreams
Labels:
None

Flags:

Important

Description

Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state.

If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder.

If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well.

Attachments

Issue Links

is a clone of

SPARK-6249 Get Kafka offsets from consumer group in ZK when using direct stream

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Dan Dutrow

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Aug/15 20:52

Updated:: 12/Oct/16 23:22

Resolved:: 12/Oct/16 23:22

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified