Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
2.2.0
-
None
-
None
Description
It would be very useful to have a binary data reader/writer for DataFrames, presumably called via spark.read.binaryFiles, etc.
Currently, going through RDDs is annoying since it requires different code paths for Scala vs Python:
Scala:
val binaryFilesRDD = sc.binaryFiles("mypath")
val binaryFilesDF = spark.createDataFrame(binaryFilesRDD)
Python:
binaryFilesRDD = sc.binaryFiles("mypath")
binaryFilesRDD_recast = binaryFilesRDD.map(lambda x: (x[0], bytearray(x[1])))
binaryFilesDF = spark.createDataFrame(binaryFilesRDD_recast)
This is because Scala and Python sc.binaryFiles return different types, which makes sense in RDD land but not DataFrame land.
My motivation here is working with images in Spark.