Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15693

Write schema definition out for file-based data sources to avoid schema inference

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • SQL
    • None

    Description

      Spark supports reading a variety of data format, many of which don't have self-describing schema. For these file formats, Spark often can infer the schema by going through all the data. However, schema inference is expensive and does not always infer the intended schema (for example, with json data Spark always infer integer types as long, rather than int).

      It would be great if Spark can write the schema definition out for file-based formats, and when reading the data in, schema can be "inferred" directly by reading the schema definition file without going through full schema inference. If the file does not exist, then the good old schema inference should be performed.

      This ticket certainly merits a design doc that should discuss the spec for schema definition, as well as all the corner cases that this feature needs to handle (e.g. schema merging, schema evolution, partitioning). It would be great if the schema definition is using a human readable format (e.g. JSON).

      Attachments

        Activity

          People

            Unassigned Unassigned
            rxin Reynold Xin
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: