I love the SQLContext.jsonRDD() and SQLContext.jsonFile() methods. They process JSON text directly and infer a schema that covers the entire source data set.
This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.
To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned SQLContext.json...() methods do this very well.
What we need is for SQlContext.inferSchema() to do this, too. Alternatively, we need a new SQLContext method that works on RDDs of Python dictionaries and does something functionally equivalent to this:
As of 1.0.2, inferSchema() just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.
- You have some JSON text data that you want to analyze using Spark SQL.
- You would use one of the SQLContext.json...() methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
- You deserialize the JSON objects to Python dict s and filter out the bad ones. You now have an RDD of dictionaries.
- From this RDD, you want a SchemaRDD that captures the schema for the whole data set.