[KAFKA-10369] Introduce Distinct operation in KStream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: streams
Labels:
- kip

Description

Message deduplication is a common task.

One example: we might have multiple data sources each reporting its state periodically with a relatively high frequency, their current states should be stored in a database. In case the actual change of the state occurs with a lower frequency than it is reported, in order to reduce the number of writes to the database we might want to filter out duplicated messages using Kafka Streams.

'Distinct' operation is common in data processing, e. g.

Java Stream has distinct() operation,
SQL has DISTINCT keyword.

Hence it is natural to expect the similar functionality from Kafka Streams.

Although Kafka Streams Tutorials contains an example of how distinct can be emulated , but this example is complicated: it involves low-level coding with local state store and a custom transformer. It might be much more convenient to have distinct as a first-class DSL operation.

Due to 'infinite' nature of KStream, distinct operation should be windowed, similar to windowed joins and aggregations for KStreams.

See KIP-655

Attachments

Issue Links

links to

GitHub Pull Request #9210

mentioned in: Page Loading...

Activity

People

Assignee:: Ayoub Omari

Reporter:: Ivan Ponomarev

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Aug/20 18:15

Updated:: 13/May/24 20:52