Azure Blob Storage, AWS S3, and Google Cloud Storage all support retrieving byte ranges of stored objects. Current versions of NiFi processors for these services do not support fetching by byte range.
Allowing to fetch by range would allow multiple enhancements:
- Parallelized downloads
- Faster speeds if the bandwidth delay product of the connection is lower than the available bandwidth
- Load distribution over a cluster
- Cost savings
- If the file is large and only part of the file is needed, the desired part of the file can be downloaded, saving bandwidth costs by not retrieving unnecessary bytes
- Download failures would only need to retry the failed segment, rather than the full file
- Download extremely large files
- Ability to download files that are larger than the available content repo by downloading a segment and moving it off to a system with more capacity before downloading another segment
Some of these enhancements would require an upstream processor to generate multiple flow files, each covering a different part of the overall range. Something like this:
ListS3 -> ExecuteGroovyScript (to split into multiple flow files with different range attributes) -> FetchS3Object.