Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-399

Rename EvaluateRegularExpression to ExtractText and optimize

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: Extensions
    • Labels:

      Description

      The processor EvaluateRegularExpression enables some cool extraction of text from data. It currently limits matching results to a single matching result. It should be updated to allow multiple capture groups per matching term. It can keep the current behavior. But can also add inclusion of all matching groups 0..n as an index appended to the basename of the attribute.

      In addition the name of this processor (and possibly its tags) needs to be updated. The processor is used to extract text from a given document. The name should be 'ExtractText'. We can deprecate the old processor in 0.1.0 and in 0.2.0 pull it out.

      In addition this processor should:

      • Precompile all patterns when the processor is scheduled to run.
      • Create memory buffers that do not exceed the minimum of flow file content or max buffer size specified
      • Support more than 1 capturing groups. The default behavior of storing capture group 1 at the given name is good. But there is also benefit to supporting multiple capture groups in a single execution.
      • Allow the user to specify the maximum length of a capturing group value

      This also prompts the need for a StandardValidator which allows for creation of a validator that does a bounds check on a given DataSize.

        Attachments

        1. NIFI-399.patch
          68 kB
          Joe Witt

          Activity

            People

            • Assignee:
              joewitt Joe Witt
              Reporter:
              joewitt Joe Witt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: