Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: ML
    • Labels:

      Description

      Background and motivation

      As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.

      This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.

      This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.

      The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.

      Targets users and personas:

      Data scientists, data engineers, library developers.
      The following libraries define primitives for loading and representing images, and will gain from a common interchange format (in alphabetical order):

      • BigDL
      • DeepLearning4J
      • Deep Learning Pipelines
      • MMLSpark
      • TensorFlow (Spark connector)
      • TensorFlowOnSpark
      • TensorFrames
      • Thunder

      Goals:

      • Simple representation of images in Spark DataFrames, based on pre-existing industrial standards (OpenCV)
      • This format should eventually allow the development of high-performance integration points with image processing libraries such as libOpenCV, Google TensorFlow, CNTK, and other C libraries.
      • The reader should be able to read popular formats of images from distributed sources.

      Non-Goals:

      Images are a versatile medium and encompass a very wide range of formats and representations. This SPIP explicitly aims at the most common use case in the industry currently: multi-channel matrices of binary, int32, int64, float or double data that can fit comfortably in the heap of the JVM:

      • the total size of an image should be restricted to less than 2GB (roughly)
      • the meaning of color channels is application-specific and is not mandated by the standard (in line with the OpenCV standard)
      • specialized formats used in meteorology, the medical field, etc. are not supported
      • this format is specialized to images and does not attempt to solve the more general problem of representing n-dimensional tensors in Spark

      Proposed API changes

      We propose to add a new package in the package structure, under the MLlib project:
      org.apache.spark.image

      Data format

      We propose to add the following structure:

      imageSchema = StructType([

      • StructField("mode", StringType(), False),
        • The exact representation of the data.
        • The values are described in the following OpenCV convention. Basically, the type has both "depth" and "number of channels" info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 32 in the table) with the channel order specified by convention.
        • The exact channel ordering and meaning of each channel is dictated by convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
          If the image failed to load, the value is the empty string "".
      • StructField("origin", StringType(), True),
        • Some information about the origin of the image. The content of this is application-specific.
        • When the image is loaded from files, users should expect to find the file name in this field.
      • StructField("height", IntegerType(), False),
        • the height of the image, pixels
        • If the image fails to load, the value is -1.
      • StructField("width", IntegerType(), False),
        • the width of the image, pixels
        • If the image fails to load, the value is -1.
      • StructField("nChannels", IntegerType(), False),
        • The number of channels in this image: it is typically a value of 1 (B&W), 3 (RGB), or 4 (BGRA)
        • If the image fails to load, the value is -1.
      • StructField("data", BinaryType(), False)
        • packed array content. Due to implementation limitation, it cannot currently store more than 2 billions of pixels.
        • The data is stored in a pixel-by-pixel BGR row-wise order. This follows the OpenCV convention.
        • If the image fails to load, this array is empty.

      For more information about image types, here is an OpenCV guide on types: http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed-pixel-types-limited-use-of-templates

      The reference implementation provides some functions to convert popular formats (JPEG, PNG, etc.) to the image specification above, and some functions to verify if an image is valid.

      Image ingest API

      We propose the following function to load images from a remote distributed source as a DataFrame. Here is the signature in Scala. The python interface is similar. For compatibility with java, this function should be made available through a builder pattern or through the DataSource API. The exact mechanics can be discussed during implementation; the goal of the proposal below is to propose a specification of the behavior.

      def readImages(
          path: String,
          session: SparkSession = null,
          recursive: Boolean = false,
          numPartitions: Int = 0,
          dropImageFailures: Boolean = false,
          // Experimental options
          sampleRatio: Double = 1.0): DataFrame
      

      The type of the returned DataFrame should be the structure type above, with the expectation that all the file names be filled.

      Mandatory parameters:

      • path: a directory for a file system that contains images
        Optional parameters:
      • session (SparkSession, default null): the Spark Session to use to create the dataframe. If not provided, it will use the current default Spark session via SparkSession.getOrCreate().
      • recursive (bool, default false): take the top-level images or look into directory recursively
      • numPartitions (int, default null): the number of partitions of the final dataframe. By default uses the default number of partitions from Spark.
      • dropImageFailures (bool, default false): drops the files that failed to load. If false (do not drop), some invalid images are kept.

      Parameters that are experimental/may be quickly deprecated. These would be useful to have but are not critical for a first cut:

      • sampleRatio (float, in (0,1), default 1): if less than 1, returns a fraction of the data. There is no statistical guarantee about how the sampling is performed. This proved to be very helpful for fast prototyping. Marked as experimental since it should be pushed to the Spark core.

      The implementation is expected to be in Scala for performance, with a wrapper for python.
      This function should be lazy to the extent possible: it should not trigger access to the data when called. Ideally, any file system supported by Spark should be supported when loading images. There may be restrictions for some options such as zip files, etc.

      The reference implementation has also some experimental options (undocumented here).

      Reference implementation

      A reference implementation is available as an open-source Spark package in this repository (Apache 2.0 license):
      https://github.com/Microsoft/spark-images

      This Spark package will also be published in a binary form on spark-packages.org .

      Comments about the API should be addressed in this ticket.

      Optional Rejected Designs

      The use of User-Defined Types was considered. It adds some burden to the implementation of various languages and does not provide significant advantages.

        Issue Links

          Activity

          Hide
          tomas.nykodym Tomas Nykodym added a comment -

          I've created a separate ticket to add support for non-integer based images in SPARK-22730

          Show
          tomas.nykodym Tomas Nykodym added a comment - I've created a separate ticket to add support for non-integer based images in SPARK-22730
          Hide
          timhunter Timothy Hunter added a comment -

          Joseph K. Bradley I have created a separate ticket to continue progress on the reader interface in SPARK-22666.

          Show
          timhunter Timothy Hunter added a comment - Joseph K. Bradley I have created a separate ticket to continue progress on the reader interface in SPARK-22666 .
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19835

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19835
          Hide
          rxin@databricks.com Reynold Xin added a comment -

          Why not just declare an image function that loads the image data source?
          The function will throw an exception if one cannot be loaded.

          On Thu, Nov 23, 2017 at 7:53 AM Joseph K. Bradley (JIRA) <jira@apache.org>

          Show
          rxin@databricks.com Reynold Xin added a comment - Why not just declare an image function that loads the image data source? The function will throw an exception if one cannot be loaded. On Thu, Nov 23, 2017 at 7:53 AM Joseph K. Bradley (JIRA) <jira@apache.org>
          Hide
          josephkb Joseph K. Bradley added a comment -

          As far as I know, it shouldn't be a problem. The new datasource can be in mllib since the datasource API permits custom datasources. People will be able to write spark.read.format("image"), though they won't be able to write spark.read.image(...). E.g., https://github.com/databricks/spark-avro lives outside of sql.

          Show
          josephkb Joseph K. Bradley added a comment - As far as I know, it shouldn't be a problem. The new datasource can be in mllib since the datasource API permits custom datasources. People will be able to write spark.read.format("image") , though they won't be able to write spark.read.image(...) . E.g., https://github.com/databricks/spark-avro lives outside of sql .
          Hide
          josephkb Joseph K. Bradley added a comment -

          Issue resolved by pull request 19439
          https://github.com/apache/spark/pull/19439

          Show
          josephkb Joseph K. Bradley added a comment - Issue resolved by pull request 19439 https://github.com/apache/spark/pull/19439
          Hide
          timhunter Timothy Hunter added a comment -

          Joseph K. Bradley if I am not mistaken, the image code is implemented in the mllib package, which depends on sql. Meanwhile, the data source API is implemented in sql, and if we want it to have some image-specific source, like we do for csv or json, we will need to depend on mllib. This dependency should not happen, first because it introduces a circular dependency (causing compile time issues), and second because sql (one of the core modules) should not depend on mllib, which is large and not related to SQL.

          Reynold Xin suggested that we add a runtime dependency using reflection instead, and I am keen on making that change a second pull request. What are your thoughts?

          Show
          timhunter Timothy Hunter added a comment - Joseph K. Bradley if I am not mistaken, the image code is implemented in the mllib package, which depends on sql . Meanwhile, the data source API is implemented in sql , and if we want it to have some image-specific source, like we do for csv or json, we will need to depend on mllib . This dependency should not happen, first because it introduces a circular dependency (causing compile time issues), and second because sql (one of the core modules) should not depend on mllib , which is large and not related to SQL. Reynold Xin suggested that we add a runtime dependency using reflection instead, and I am keen on making that change a second pull request. What are your thoughts?
          Hide
          josephkb Joseph K. Bradley added a comment -

          Timothy Hunter you made a similar comment above about a "soft dependency," but as I commented there, I don't quite see what that "soft" dependency will be. Why would it make core depend upon mllib?

          Show
          josephkb Joseph K. Bradley added a comment - Timothy Hunter you made a similar comment above about a "soft dependency," but as I commented there, I don't quite see what that "soft" dependency will be. Why would it make core depend upon mllib?
          Hide
          timhunter Timothy Hunter added a comment -

          Adding spark.read.image is going to create a (soft) dependency between the core and mllib, which hosts the implementation of the current reader methods. This is fine and can dealt with using reflection, but since this would involve adding a core API to Spark, I suggest we do it as a follow-up task.

          Show
          timhunter Timothy Hunter added a comment - Adding spark.read.image is going to create a (soft) dependency between the core and mllib, which hosts the implementation of the current reader methods. This is fine and can dealt with using reflection, but since this would involve adding a core API to Spark, I suggest we do it as a follow-up task.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Weichen Xu I prefer a datasource API to an ad-hoc API for 2 reasons:

          • APIs: I'd like to use familiar, existing APIs (SQL datasources), rather than introducing new ones (static read functions).
          • Optimizations: I agree we don't need tons of optimizations right now, but it would be nice to leave the option open for the future.
          Show
          josephkb Joseph K. Bradley added a comment - Weichen Xu I prefer a datasource API to an ad-hoc API for 2 reasons: APIs: I'd like to use familiar, existing APIs (SQL datasources), rather than introducing new ones (static read functions). Optimizations: I agree we don't need tons of optimizations right now, but it would be nice to leave the option open for the future.
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          I came here to say data source thing and just found this discussion. To me, +1 for it.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - I came here to say data source thing and just found this discussion. To me, +1 for it.
          Hide
          WeichenXu123 Weichen Xu added a comment -

          Joseph K. Bradley
          The datasource API has advantage of expoloiting SQL optimizer. (filter push-down & column pruning), e.g:

          spark.read.image(...).filter("image.width > 100").cache()
          

          Datasource API allow us to do some optimization to avoid scanning images which "image.width <=100" (i.e we can get filter information through datasource reader interface).
          But, do we really need such optimization ?

          Show
          WeichenXu123 Weichen Xu added a comment - Joseph K. Bradley The datasource API has advantage of expoloiting SQL optimizer. (filter push-down & column pruning), e.g: spark.read.image(...).filter( "image.width > 100" ).cache() Datasource API allow us to do some optimization to avoid scanning images which "image.width <=100" (i.e we can get filter information through datasource reader interface). But, do we really need such optimization ?
          Hide
          apachespark Apache Spark added a comment -

          User 'imatiach-msft' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19439

          Show
          apachespark Apache Spark added a comment - User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/19439
          Hide
          yuhaoyan yuhao yang added a comment -

          My two cents,

          1. In most scenarios, deep learning applications use rescaled/cropped images (typically 256, 224 or smaller). I would add an extra parameter "smallSideSize" to the readImages method, which is more convenient for the users and we don't need to cache the image of original size (which could be 100 times larger than the scaled image).

          2. Not sure about the reason to include path info into the image data. Based on my experience, path info serves better as a separate column in the DataFrame.

          3. After some argumentation and normalization, the image data will be floating point numbers rather than the bytes. It's fine if the current format is only for reading the image data, but not as the standard image feature exchange format in Spark.

          4. I don't see the parameter "recursive" as necessary. Existing wild card matching provides more functions.

          Part of the image pre-processing code I used (a little stale) is available from https://github.com/hhbyyh/SparkDL, just for reference.

          Show
          yuhaoyan yuhao yang added a comment - My two cents, 1. In most scenarios, deep learning applications use rescaled/cropped images (typically 256, 224 or smaller). I would add an extra parameter "smallSideSize" to the readImages method, which is more convenient for the users and we don't need to cache the image of original size (which could be 100 times larger than the scaled image). 2. Not sure about the reason to include path info into the image data. Based on my experience, path info serves better as a separate column in the DataFrame. 3. After some argumentation and normalization, the image data will be floating point numbers rather than the bytes. It's fine if the current format is only for reading the image data, but not as the standard image feature exchange format in Spark. 4. I don't see the parameter "recursive" as necessary. Existing wild card matching provides more functions. Part of the image pre-processing code I used (a little stale) is available from https://github.com/hhbyyh/SparkDL , just for reference.
          Hide
          timhunter Timothy Hunter added a comment -

          Putting this code under org.apache.spark.ml.image sounds good to me. Based on the initial exploration, it should not be too hard to integrate this in the data source framework. I am going to submit this proposal to a vote on the dev mailing list.

          Show
          timhunter Timothy Hunter added a comment - Putting this code under org.apache.spark.ml.image sounds good to me. Based on the initial exploration, it should not be too hard to integrate this in the data source framework. I am going to submit this proposal to a vote on the dev mailing list.
          Hide
          josephkb Joseph K. Bradley added a comment -

          1. For the namespace, here are my thoughts:

          I don't feel too strongly about this, but I'd vote for putting it under org.apache.spark.ml.image.
          Pros:

          • The image package will be in the spark-ml sub-project, and this fits that structure.
          • This will avoid polluting the o.a.s namespace, and we do not yet have any other data types listed under o.a.s.
            Cons:
          • Images are more general than ML. We might want to move the image package out of spark-ml eventually.

          2. For the SQL data source, HUGE +1 for making a data source

          I'm glad it's mentioned in the SPIP, but I would really like to see it prioritized. There's no need to make a dependency between SQL and ML by adding options to the image data source reader; data sources support optional arguments. E.g., the CSV data source has option "delimiter" but that is wholly contained within the data source; it doesn't affect other data sources. Is there an option needed by the image data source which will force us to abuse the data source API?

          Show
          josephkb Joseph K. Bradley added a comment - 1. For the namespace, here are my thoughts: I don't feel too strongly about this, but I'd vote for putting it under org.apache.spark.ml.image . Pros: The image package will be in the spark-ml sub-project, and this fits that structure. This will avoid polluting the o.a.s namespace, and we do not yet have any other data types listed under o.a.s. Cons: Images are more general than ML. We might want to move the image package out of spark-ml eventually. 2. For the SQL data source, HUGE +1 for making a data source I'm glad it's mentioned in the SPIP, but I would really like to see it prioritized. There's no need to make a dependency between SQL and ML by adding options to the image data source reader; data sources support optional arguments. E.g., the CSV data source has option "delimiter" but that is wholly contained within the data source; it doesn't affect other data sources. Is there an option needed by the image data source which will force us to abuse the data source API?
          Hide
          yanboliang Yanbo Liang added a comment -

          Timothy Hunter Fair enough.

          Show
          yanboliang Yanbo Liang added a comment - Timothy Hunter Fair enough.
          Hide
          timhunter Timothy Hunter added a comment -

          Yanbo Liang thanks you for the comments. Regarding your questions:

          1. making image part of ml or not: I do not have a strong preference, but I think that image support is more general than machine learning.

          2. there is no obstacle, but that would create a dependency between the core (spark.read) and an external module. This sort of dependency inversion is not great design, as any change into a sub-package will have API repercussion into the core of Spark. The SQL team is already struggling with such issues.

          Show
          timhunter Timothy Hunter added a comment - Yanbo Liang thanks you for the comments. Regarding your questions: 1. making image part of ml or not: I do not have a strong preference, but I think that image support is more general than machine learning. 2. there is no obstacle, but that would create a dependency between the core ( spark.read ) and an external module. This sort of dependency inversion is not great design, as any change into a sub-package will have API repercussion into the core of Spark. The SQL team is already struggling with such issues.
          Hide
          yanboliang Yanbo Liang added a comment - - edited

          I would support this effort generally. For Spark, to provide a general image storage format and data source is good to have. This can let users try different deep neural network models convenience. AFAIK, lots of users would be interested in applying existing deep neural models to their own dataset, that is to say, model inference, which can be distributed running by Spark. Thanks for this proposal.
          Timothy Hunter I have two questions regarding this SPIP:
          1, As you describe above: org.apache.spark.image is the package structure, under the MLlib project.
          If this package would only contain the common image storage format and data source support, should we organize the package structure as org.apache.spark.ml.image or org.apache.spark.ml.source.image? We already have libsvm support under org.apache.spark.ml.source.
          2, From the API's perspective, could we follow other Spark SQL data source to the greatest extent? Even we don't use UDT, a familiar API would make more users to adopt it. For example, the following API would be more friendly to Spark users. Is there any obstacle to implement like this?

          spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
          spark.write.image(path, ...)
          

          If I have misunderstand, please feel free to correct me. Thanks.

          Show
          yanboliang Yanbo Liang added a comment - - edited I would support this effort generally. For Spark, to provide a general image storage format and data source is good to have. This can let users try different deep neural network models convenience. AFAIK, lots of users would be interested in applying existing deep neural models to their own dataset, that is to say, model inference, which can be distributed running by Spark. Thanks for this proposal. Timothy Hunter I have two questions regarding this SPIP: 1, As you describe above: org.apache.spark.image is the package structure, under the MLlib project. If this package would only contain the common image storage format and data source support, should we organize the package structure as org.apache.spark.ml.image or org.apache.spark.ml.source.image ? We already have libsvm support under org.apache.spark.ml.source . 2, From the API's perspective, could we follow other Spark SQL data source to the greatest extent? Even we don't use UDT, a familiar API would make more users to adopt it. For example, the following API would be more friendly to Spark users. Is there any obstacle to implement like this? spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio) spark.write.image(path, ...) If I have misunderstand, please feel free to correct me. Thanks.
          Hide
          matei Matei Zaharia added a comment -

          Just to chime in on this, I've also seen feedback that the deep learning libraries for Spark are too fragmented: there are too many of them, and people don't know where to start. This standard representation would at least give them a clear way to interoperate. It would let people write separate libraries for image processing, data augmentation and then training for example.

          Show
          matei Matei Zaharia added a comment - Just to chime in on this, I've also seen feedback that the deep learning libraries for Spark are too fragmented: there are too many of them, and people don't know where to start. This standard representation would at least give them a clear way to interoperate. It would let people write separate libraries for image processing, data augmentation and then training for example.
          Hide
          timhunter Timothy Hunter added a comment -

          Updated authors' list.

          Show
          timhunter Timothy Hunter added a comment - Updated authors' list.
          Hide
          danil.kirsanov@gmail.com Danil Kirsanov added a comment -

          Hi Sean, echoing the previous comments: yes, this is a small project, basically just a schema and the functions for reading images.
          At the same time, figuring it out proved to be quite time consuming, so it's easier to agree on a common format that could be shared among different pipelines and libraries.

          Show
          danil.kirsanov@gmail.com Danil Kirsanov added a comment - Hi Sean, echoing the previous comments: yes, this is a small project, basically just a schema and the functions for reading images. At the same time, figuring it out proved to be quite time consuming, so it's easier to agree on a common format that could be shared among different pipelines and libraries.
          Hide
          timhunter Timothy Hunter added a comment -

          Sean Owen thank you for the comments. Indeed, this proposal is limited in scope on purpose, because it aims at achieving consensus around multiple libraries. For instance, the MMLSpark project from Microsoft uses this data format to interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is going to rely on it as its primary mechanism to load and process images. Also, nothing precludes adding common transforms to this package later - it is easier to start small.

          Regarding the spark package, yes, it will be discontinued like the CSV parser. The aim is to offer a working library that can be tried out without having to wait for an implementation to be merged into Spark itself.

          Show
          timhunter Timothy Hunter added a comment - Sean Owen thank you for the comments. Indeed, this proposal is limited in scope on purpose, because it aims at achieving consensus around multiple libraries. For instance, the MMLSpark project from Microsoft uses this data format to interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is going to rely on it as its primary mechanism to load and process images. Also, nothing precludes adding common transforms to this package later - it is easier to start small. Regarding the spark package, yes, it will be discontinued like the CSV parser. The aim is to offer a working library that can be tried out without having to wait for an implementation to be merged into Spark itself.
          Hide
          srowen Sean Owen added a comment -

          It makes some sense. I guess I'm mostly trying to match up the scope that a SPIP implies, with the relatively simple functionality here. Is this not just about a page of code to call ImageIO to parse a BufferedImage and to map its fields to a Row? That does look like the substance of https://github.com/Microsoft/spark-images/blob/master/src/main/scala/org/apache/spark/image/ImageSchema.scala Well, maybe this is just a really small SPIP.

          Also why: "This Spark package will also be published in a binary form on spark-packages.org ." It'd be discontinued and included in Spark right, like with the CSV parser?

          Show
          srowen Sean Owen added a comment - It makes some sense. I guess I'm mostly trying to match up the scope that a SPIP implies, with the relatively simple functionality here. Is this not just about a page of code to call ImageIO to parse a BufferedImage and to map its fields to a Row? That does look like the substance of https://github.com/Microsoft/spark-images/blob/master/src/main/scala/org/apache/spark/image/ImageSchema.scala Well, maybe this is just a really small SPIP. Also why: "This Spark package will also be published in a binary form on spark-packages.org ." It'd be discontinued and included in Spark right, like with the CSV parser?
          Hide
          josephkb Joseph K. Bradley added a comment -

          It's a valid question, but overall, I'd support this effort. My thoughts:

          Summary: Image processing use cases have become increasingly important, especially because of the rise of interest in deep learning. It's valuable to standardize around a common format, partly for users and partly for developers.

          Q: Are images a common data type? I.e., if we were talking about adding support for storing text in Spark DataFrames, there would be no question that Spark must be able to handle text since it is such a common data format. Are images common enough to merit inclusion in Spark?
          A: I'd argue yes, partly because of the rise in requests around it. But also, if it makes sense for a general purpose language like Java to contain image formats, then it likewise makes sense for a general purpose data processing library like Spark to contain image formats. This does not duplicate functionality from java.awt (or other libraries) since the key elements being added here are Spark-specific: a Spark DataFrame schema and a Spark Data Source.

          Q: Will leaving this functionality in a package, rather than putting it in Spark, be sufficient?
          A: I worry that this will limit adoption, as well as community oversight of such a core piece of functionality. Tooling built upon image formats, including image processing algorithms, could live outside of Spark, but basic image loading and saving should IMO live in Spark.

          Q: Will users really benefit?
          A: My main reason to support this is confusion I've heard about the right way to handle images in Spark. They are sometimes handled outside of Spark's data model (often giving up proper resilience guarantees), are handled by falling back to the RDD API, etc. I hope that standardization will simplify life for users (clarifying and standardizing APIs) and library developers (facilitating collaboration on image ETL).

          Show
          josephkb Joseph K. Bradley added a comment - It's a valid question, but overall, I'd support this effort. My thoughts: Summary: Image processing use cases have become increasingly important, especially because of the rise of interest in deep learning. It's valuable to standardize around a common format, partly for users and partly for developers. Q: Are images a common data type? I.e., if we were talking about adding support for storing text in Spark DataFrames, there would be no question that Spark must be able to handle text since it is such a common data format. Are images common enough to merit inclusion in Spark? A: I'd argue yes, partly because of the rise in requests around it. But also, if it makes sense for a general purpose language like Java to contain image formats, then it likewise makes sense for a general purpose data processing library like Spark to contain image formats. This does not duplicate functionality from java.awt (or other libraries) since the key elements being added here are Spark-specific: a Spark DataFrame schema and a Spark Data Source. Q: Will leaving this functionality in a package, rather than putting it in Spark, be sufficient? A: I worry that this will limit adoption, as well as community oversight of such a core piece of functionality. Tooling built upon image formats, including image processing algorithms, could live outside of Spark, but basic image loading and saving should IMO live in Spark. Q: Will users really benefit? A: My main reason to support this is confusion I've heard about the right way to handle images in Spark. They are sometimes handled outside of Spark's data model (often giving up proper resilience guarantees), are handled by falling back to the RDD API, etc. I hope that standardization will simplify life for users (clarifying and standardizing APIs) and library developers (facilitating collaboration on image ETL).
          Hide
          srowen Sean Owen added a comment -

          Why would this need to be part of Spark? I assume it's Spark-specific, yes, but it already exists as a standalone library. You're saying it will continue to be a stand-alone package too? It also doesn't seem to add any advantages in representation; this seems like what one would get reading any image into, say, BufferedImage and then picking out its channels.

          Show
          srowen Sean Owen added a comment - Why would this need to be part of Spark? I assume it's Spark-specific, yes, but it already exists as a standalone library. You're saying it will continue to be a stand-alone package too? It also doesn't seem to add any advantages in representation; this seems like what one would get reading any image into, say, BufferedImage and then picking out its channels.

            People

            • Assignee:
              imatiach Ilya Matiach
              Reporter:
              timhunter Timothy Hunter
              Shepherd:
              Joseph K. Bradley
            • Votes:
              2 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development