Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22666

Spark datasource for image format

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      The current API for the new image format is implemented as a standalone feature, in order to make it reside within the mllib package. As discussed in SPARK-21866, users should be able to load images through the more common spark source reader interface.

      This ticket is concerned with adding image reading support in the spark source API, through either of the following interfaces:

      • spark.read.format("image")...
      • spark.read.image....
        The output is a dataframe that contains images (and the file names for example), following the semantics discussed already in SPARK-21866.

      A few technical notes:

      • since the functionality is implemented in mllib, calling this function may fail at runtime if users have not imported the spark-mllib dependency
      • How to deal with very flat directories? It is common to have millions of files in a single "directory" (like in S3), which seems to have caused some issues to some users. If this issue is too complex to handle in this ticket, it can be dealt with separately.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                WeichenXu123 Weichen Xu
                Reporter:
                timhunter Timothy Hunter
                Shepherd:
                Xiangrui Meng
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: