Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25348

Data source for binary files

    XMLWordPrintableJSON

    Details

    • Type: Story
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos.

      Microsoft has an implementation at https://github.com/Azure/mmlspark/tree/master/src/io/binary. It would be great if we can merge it into Spark main repo.

      cc: Mark Hamilton and Ilya Matiach

      Proposed API:

      Format name: "binaryFile"

      Schema:

      • content: BinaryType
      • status (following Hadoop FIleStatus):
        • path: StringType
        • modificationTime: Timestamp
        • length: LongType (size limit 2GB)

      Options:

      • pathGlobFilter: only include files with path matching the glob pattern

      Input partition size can be controlled by common SQL confs: maxPartitionBytes and openCostInBytes

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                weichenxu123 Weichen Xu
                Reporter:
                mengxr Xiangrui Meng
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: