[NIFI-6047] Add DetectDuplicateRecord Processor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.16.0
Component/s: Core Framework
Labels:
- features

Flags:

Important

Description

Add a new standard NiFi processor to supplement the DetectDuplicate processor. The difference is this one works at the record-level.

DetectDuplicateRecord

Caches records from each incoming FlowFile and determines if the cached record has already been seen. The name of user-defined properties determines the RecordPath values used to determine if a record is unique. If no user-defined properties are present, the entire record is used as the input to determine uniqueness. All duplicate records are routed to 'duplicate'. If the record is not determined to be a duplicate, the Processor routes the record to 'non-duplicate'.

This processor makes two different filtering data structures available depending on the level of precision and amount of records the user wishes to process:

A HashSet filter type will guarantee 100% duplicate detection at the expense of storing one hash per record.
A BloomFilter filter type will use efficient/constant space through probabilistic guarantees. This is useful when processing an extremely large number of records and some false positives are acceptable (i.e. some records may be marked as duplicate even though they have not been seen before).

Attachments

Issue Links

Dependency

NIFI-6166 Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor

Open

is duplicated by

NIFI-6014 Create a record-oriented version of DetectDuplicate

Resolved

links to

GitHub Pull Request #3317

GitHub Pull Request #4646

Activity

People

Assignee:: Adam Fisher

Reporter:: Adam Fisher

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Feb/19 23:48

Updated:: 10/Mar/22 00:09

Resolved:: 10/Mar/22 00:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

19h