It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos.
Microsoft has an implementation at https://github.com/Azure/mmlspark/tree/master/src/io/binary. It would be great if we can merge it into Spark main repo.
Format name: "binaryFile"
- content: BinaryType
- status (following Hadoop FIleStatus):
- path: StringType
- modificationTime: Timestamp
- length: LongType (size limit 2GB)
- pathGlobFilter: only include files with path matching the glob pattern
Input partition size can be controlled by common SQL confs: maxPartitionBytes and openCostInBytes