Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-9892

Update Azure storage related processors to align with NiFi best practices

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Extensions
    • None

    Description

      The Azure Storage related processors have been through many iterations and updates. At this point, they are not following best practices for NiFi processors and, as a result, make configuring the processors difficult for users. Specifically, I have found the following problems in using them:

      • Properties are ordered inconsistently. Some processors have Credentials Service as the first property, some as 4th property. The ordering of 'directory' vs 'blob name' vs 'file system' are random. Processors should take great care when ordering properties in order to expose the most important properties first, in order to make configuration as simple as possible for the user.
      • Default values are inconsistent. Some processors use a default value for "Blob", some don't. The same sort of inconsistency exists for many properties.
      • Default values are largely missing. It does not make sense to have a default value for the "Container" property for "ListAzureBlobStorage" but the default value should absolutely be set for "FetchAzureBlobStorage", as well as move, delete, put, etc.
      • Poor default values. Some property values do have defaults. But they default to values like "${azure.blob}" for the filename when the NiFI Core Attribute of "filename" makes more sense.
      • Inconsistent property names. Some properties use the property name "Blob" while others use "Blob Name" to mean the same thing. There may be other, similar examples.
      • List processors do not populate core attributes. Listing processors should always populate attributes such as "filename" and "path", rather than just more specific attributes such as "azure.blob"
      • Abstract processors exist and implement `getSupportedPropertyDescriptors()`.  This is an anti-pattern. While it makes sense in many cases to implement `Set<Relationship> getRelationships()`, it should NOT implement `List<PropertyDescriptor> getSupportedPropertyDescriptors()`. This is because the point of the abstract class is for others to extend it. Others that extend it will have more customized Lists of Property Descriptors. This often leads to a pattern of calling `super.getSupportedPropertyDescriptors()` and then adding specific properties to the end of the List. This is bad because, as noted above, properties should be ordered in a way that makes the most sense for that particular processor, adding the most important properties first and keeping like properties together.
      • Directory Name property is awkward and confusing. The Directory Name property is required for several properties. A value of "/" is invalid and no value can have a leading "/". To write to the root directory, the user must go in and click the checkbox for "Empty Value". The fact that this needed to be explicitly called out in the documentation is a key indicator that it violates user expectations. While the Azure client may not allow a leading `/`, the property should absolutely allow it. And the property should not be required, allowing an unset value to default to the root directory.
      • Code cleanup
        • Processors use old API for session read/write callbacks. Then create `AtomicReference<Exception>` to hold Exceptions so that they can be inspected later, etc. This can be cleaned up by using newer API methods that return `InputStream` / `OutputStream`.
        • Code should mark variables `final` when possible.
      • Integration tests no longer work
        • Some of the Integration tests do not work. Some work sometimes and fail intermittently. They need to either be fixed or deleted
      • FetchAzureDatalakeStorage emits both CONTENT_MODIFIED and FETCH provenance events. These two should not be emitted by the same processor, as FETCH implies that the content of the FlowFile was modified to match that of the data that was FETCHed.

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m