[SPARK-15693] Write schema definition out for file-based data sources to avoid schema inference - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Spark supports reading a variety of data format, many of which don't have self-describing schema. For these file formats, Spark often can infer the schema by going through all the data. However, schema inference is expensive and does not always infer the intended schema (for example, with json data Spark always infer integer types as long, rather than int).

It would be great if Spark can write the schema definition out for file-based formats, and when reading the data in, schema can be "inferred" directly by reading the schema definition file without going through full schema inference. If the file does not exist, then the good old schema inference should be performed.

This ticket certainly merits a design doc that should discuss the spec for schema definition, as well as all the corner cases that this feature needs to handle (e.g. schema merging, schema evolution, partitioning). It would be great if the schema definition is using a human readable format (e.g. JSON).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Reynold Xin

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 01/Jun/16 06:58

Updated:: 21/Sep/18 05:48

Resolved:: 21/Sep/18 05:48