[SPARK-7505] Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Invalid
Affects Version/s: 1.3.1, 1.4.1, 1.5.0
Fix Version/s: None
Component/s: Documentation, PySpark, SQL
Labels:
None

Description

The PySpark docs for DataFrame need the following fixes and improvements:

Per ~~SPARK-7035~~, we should encourage the use of __getitem__ over __getattr__ and change all our examples accordingly.
We should say clearly that the API is experimental. (That is currently not the case for the PySpark docs.)

We should provide an example of how to join and select from 2 DataFrames that have identically named columns, because it is not obvious:

>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
>>> df12 = df1.join(df2, df1['a'] == df2['a'])
>>> df12.select(df1['a'], df2['other']).show()
a other                                                                               
4 I dunno

DF.orderBy and DF.sort should be marked as aliases if that's what they are.

Attachments

Issue Links

is related to

SPARK-7544 pyspark.sql.types.Row should implement __getitem__

Resolved

SPARK-7035 Drop __getattr__ on pyspark.sql.DataFrame

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/May/15 17:48

Updated:: 05/Aug/16 17:36

Resolved:: 05/Aug/16 17:36