Description
It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos.
Microsoft has an implementation at https://github.com/Azure/mmlspark/tree/master/src/io/binary. It would be great if we can merge it into Spark main repo.
Proposed API:
Format name: "binaryFile"
Schema:
- content: BinaryType
- status (following Hadoop FIleStatus):
- path: StringType
- modificationTime: Timestamp
- length: LongType (size limit 2GB)
Options:
- pathGlobFilter: only include files with path matching the glob pattern
Input partition size can be controlled by common SQL confs: maxPartitionBytes and openCostInBytes
Attachments
Issue Links
- blocks
-
SPARK-27534 Do not load `content` column in binary data source if it is not selected
- Resolved
-
SPARK-27472 Docuement binary file data source in Spark user guide
- Resolved
-
SPARK-27473 Support filter push down for status fields in binary file data source
- Resolved
- relates to
-
SPARK-22666 Spark datasource for image format
- Resolved
- links to