Description
Background
I love the SQLContext.jsonRDD() and SQLContext.jsonFile() methods. They process JSON text directly and infer a schema that covers the entire source data set.
This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.
For example:
{"a": 5} {"a": "cow"}
To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned SQLContext.json...() methods do this very well.
Feature Request
What we need is for SQlContext.inferSchema() to do this, too. Alternatively, we need a new SQLContext method that works on RDDs of Python dictionaries and does something functionally equivalent to this:
SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
As of 1.0.2, inferSchema() just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.
Example Use Case
- You have some JSON text data that you want to analyze using Spark SQL.
- You would use one of the SQLContext.json...() methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
- You deserialize the JSON objects to Python dict s and filter out the bad ones. You now have an RDD of dictionaries.
- From this RDD, you want a SchemaRDD that captures the schema for the whole data set.
Attachments
Issue Links
- links to