Details
Description
This is strange and looks like a regression from 1.3.
import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line
On 1.3 this works and yields:
age 28 31 Out[1]: Column<stats.age AS age#2958L>
On 1.4, however, this gives an error on the last line:
+---+ |age| +---+ | 28| | 31| +---+ --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-1-04bd990e94c6> in <module>() 19 20 df.select('stats.age').show() ---> 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: --> 680 raise IndexError("no such column: %s" % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age
This means, among other things, that you can't join DataFrames on nested columns.