[SPARK-8670] Nested columns can't be referenced (but they can be selected) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1, 1.5.0
Fix Version/s: 1.5.0
Component/s: PySpark, SQL
Labels:
None

Target Version/s:

1.5.0

Description

This is strange and looks like a regression from 1.3.

import json

daterz = [
  {
    'name': 'Nick',
    'stats': {
      'age': 28
    }
  },
  {
    'name': 'George',
    'stats': {
      'age': 31
    }
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line

On 1.3 this works and yields:

age
28 
31 
Out[1]: Column<stats.age AS age#2958L>

On 1.4, however, this gives an error on the last line:

+---+
|age|
+---+
| 28|
| 31|
+---+

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-04bd990e94c6> in <module>()
     19 
     20 df.select('stats.age').show()
---> 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
    678         if isinstance(item, basestring):
    679             if item not in self.columns:
--> 680                 raise IndexError("no such column: %s" % item)
    681             jc = self._jdf.apply(item)
    682             return Column(jc)

IndexError: no such column: stats.age

This means, among other things, that you can't join DataFrames on nested columns.

Attachments

Issue Links

links to

[Github] Pull Request #8202 (cloud-fan)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 26/Jun/15 20:58

Updated:: 14/Aug/15 21:15

Resolved:: 14/Aug/15 21:15