Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      I think the __getattr__ method on the DataFrame should be removed.

      There is no point in having the possibility to address the DataFrames columns as df.column, other than the questionable goal to please R developers. And it seems R people can use Spark from their native API in the future.

      I see the following problems with __getattr__ for column selection:

      • It's un-pythonic: There should only be one obvious way to solve a problem, and we can already address columns on a DataFrame via the __getitem__ method, which in my opinion is by far superior and a lot more intuitive.
      • It leads to confusing Exceptions. When we mistype a method-name the AttributeError will say 'No such column ... '.
      • And most importantly: we cannot load DataFrames that have columns with the same name as any attribute on the DataFrame-object. Imagine having a DataFrame with a column named cache or filter. Calling df.cache() will be ambiguous and lead to broken code.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kalle Karl-Johan Wettin
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: