[NIFI-7745] Add a SampleRecord processor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.13.0
Component/s: Extensions
Labels:
None

Description

Sampling records in a flowfile can be a helpful way to test with "real" data, especially for source systems that contain large datasets. It may not be possible on the source system to sample the data or test NiFi flows on smaller datasets from the source system(s). Sampling in NiFi may be currently possible (such as QueryRecord with row numbers), but is likely done in-memory (in the QueryRecord case) or in a simplistic fashion.

This Jira proposes a SampleRecord processor that should offer (at the least) the following sampling options:

Interval Sampling (every Nth record)
Probabilistic Sampling (each record has a probability P of being chosen)
Reservoir Sampling (A sample of size K with each record having equal probability of being chosen)

Attachments

Issue Links

links to

GitHub Pull Request #4482

Activity

People

Assignee:: Matt Burgess

Reporter:: Matt Burgess

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Aug/20 17:04

Updated:: 02/Sep/20 16:26

Resolved:: 02/Sep/20 16:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m