Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-7745

Add a SampleRecord processor

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13.0
    • Component/s: Extensions
    • Labels:
      None

      Description

      Sampling records in a flowfile can be a helpful way to test with "real" data, especially for source systems that contain large datasets. It may not be possible on the source system to sample the data or test NiFi flows on smaller datasets from the source system(s). Sampling in NiFi may be currently possible (such as QueryRecord with row numbers), but is likely done in-memory (in the QueryRecord case) or in a simplistic fashion.

      This Jira proposes a SampleRecord processor that should offer (at the least) the following sampling options:

      Interval Sampling (every Nth record)
      Probabilistic Sampling (each record has a probability P of being chosen)
      Reservoir Sampling (A sample of size K with each record having equal probability of being chosen)

        Attachments

          Activity

            People

            • Assignee:
              mattyb149 Matt Burgess
              Reporter:
              mattyb149 Matt Burgess

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m

                  Issue deployment