Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7505

Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Invalid
    • 1.3.1, 1.4.1, 1.5.0
    • None
    • Documentation, PySpark, SQL
    • None

    Description

      The PySpark docs for DataFrame need the following fixes and improvements:

      1. Per SPARK-7035, we should encourage the use of __getitem__ over __getattr__ and change all our examples accordingly.
      2. We should say clearly that the API is experimental. (That is currently not the case for the PySpark docs.)
      3. We should provide an example of how to join and select from 2 DataFrames that have identically named columns, because it is not obvious:
        >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
        >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
        >>> df12 = df1.join(df2, df1['a'] == df2['a'])
        >>> df12.select(df1['a'], df2['other']).show()
        a other                                                                               
        4 I dunno  
      4. DF.orderBy and DF.sort should be marked as aliases if that's what they are.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nchammas Nicholas Chammas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: