The processor EvaluateRegularExpression enables some cool extraction of text from data. It currently limits matching results to a single matching result. It should be updated to allow multiple capture groups per matching term. It can keep the current behavior. But can also add inclusion of all matching groups 0..n as an index appended to the basename of the attribute.
In addition the name of this processor (and possibly its tags) needs to be updated. The processor is used to extract text from a given document. The name should be 'ExtractText'. We can deprecate the old processor in 0.1.0 and in 0.2.0 pull it out.
In addition this processor should:
- Precompile all patterns when the processor is scheduled to run.
- Create memory buffers that do not exceed the minimum of flow file content or max buffer size specified
- Support more than 1 capturing groups. The default behavior of storing capture group 1 at the given name is good. But there is also benefit to supporting multiple capture groups in a single execution.
- Allow the user to specify the maximum length of a capturing group value
This also prompts the need for a StandardValidator which allows for creation of a validator that does a bounds check on a given DataSize.
|Remove EvaluateRegularExpression (breaking change)||Resolved||Unassigned|