Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Invalid
-
1.3.1, 1.4.1, 1.5.0
-
None
-
None
Description
The PySpark docs for DataFrame need the following fixes and improvements:
- Per
SPARK-7035, we should encourage the use of __getitem__ over __getattr__ and change all our examples accordingly. - We should say clearly that the API is experimental. (That is currently not the case for the PySpark docs.)
- We should provide an example of how to join and select from 2 DataFrames that have identically named columns, because it is not obvious:
>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}'])) >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}'])) >>> df12 = df1.join(df2, df1['a'] == df2['a']) >>> df12.select(df1['a'], df2['other']).show() a other 4 I dunno
- DF.orderBy and DF.sort should be marked as aliases if that's what they are.
Attachments
Issue Links
- is related to
-
SPARK-7544 pyspark.sql.types.Row should implement __getitem__
- Resolved
-
SPARK-7035 Drop __getattr__ on pyspark.sql.DataFrame
- Closed