Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20528

Add BinaryFileReader and Writer for DataFrames

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.2.0
    • None
    • SQL
    • None

    Description

      It would be very useful to have a binary data reader/writer for DataFrames, presumably called via spark.read.binaryFiles, etc.

      Currently, going through RDDs is annoying since it requires different code paths for Scala vs Python:

      Scala:

      val binaryFilesRDD = sc.binaryFiles("mypath")
      val binaryFilesDF = spark.createDataFrame(binaryFilesRDD)
      

      Python:

      binaryFilesRDD = sc.binaryFiles("mypath")
      binaryFilesRDD_recast = binaryFilesRDD.map(lambda x: (x[0], bytearray(x[1])))
      binaryFilesDF = spark.createDataFrame(binaryFilesRDD_recast)
      

      This is because Scala and Python sc.binaryFiles return different types, which makes sense in RDD land but not DataFrame land.

      My motivation here is working with images in Spark.

      Attachments

        1. part-00000-5ae00646-8400-4b45-aa6f-d6f27068972c-c000.json
          0.1 kB
          vishal kumar yadav
        2. stocklist.json
          0.1 kB
          vishal kumar yadav
        3. stocklist.pdub
          0.0 kB
          vishal kumar yadav

        Activity

          People

            Unassigned Unassigned
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: