[SPARK-2870] Thorough schema inference directly on RDDs of Python dictionaries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

Background

I love the SQLContext.jsonRDD() and SQLContext.jsonFile() methods. They process JSON text directly and infer a schema that covers the entire source data set.

This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.

For example:

{"a": 5}
{"a": "cow"}

To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned SQLContext.json...() methods do this very well.

Feature Request

What we need is for SQlContext.inferSchema() to do this, too. Alternatively, we need a new SQLContext method that works on RDDs of Python dictionaries and does something functionally equivalent to this:

SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))

As of 1.0.2, inferSchema() just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.

Example Use Case

You have some JSON text data that you want to analyze using Spark SQL.
You would use one of the SQLContext.json...() methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
You deserialize the JSON objects to Python dict s and filter out the bad ones. You now have an RDD of dictionaries.
From this RDD, you want a SchemaRDD that captures the schema for the whole data set.

Attachments

Issue Links

links to

Related Spark User List discussion

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Aug/14 23:07

Updated:: 12/Dec/15 08:48

Resolved:: 12/Dec/15 08:48