Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25348

Data source for binary files

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

      It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos.

      Microsoft has an implementation at https://github.com/Azure/mmlspark/tree/master/src/io/binary. It would be great if we can merge it into Spark main repo.

      cc: mhamilton and imatiach

      Proposed API:

      Format name: "binaryFile"

      Schema:

      • content: BinaryType
      • status (following Hadoop FIleStatus):
        • path: StringType
        • modificationTime: Timestamp
        • length: LongType (size limit 2GB)

      Options:

      • pathGlobFilter: only include files with path matching the glob pattern

      Input partition size can be controlled by common SQL confs: maxPartitionBytes and openCostInBytes

      Attachments

        Issue Links

          Activity

            People

              weichenxu123 Weichen Xu
              mengxr Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: