Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6116 DataFrame API improvement umbrella ticket (Spark 1.5)
  3. SPARK-8568

Prevent accidental use of "and" and "or" to build invalid expressions in Python

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.1, 1.5.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      In Spark DataFrames (and in Pandas as well), the correct way to construct a conjunctive expression is to use the bitwise and operator, i.e.: "(x > 5) & (y > 6)".

      However, a lot of users assume that they should be using the Python "and" keyword, i.e. doing "x > 5 and y > 6". Python's boolean evaluation logic converts "x > 5 and y > 6" into just "y > 6" (since "x > 5" is not None). This is super confusing & error prone.

      We should override _bool_ and _nonzero_ for Column to throw an exception if users call "and" and "or" on Column expressions.

      Background: see this blog post http://www.nodalpoint.com/unexpected-behavior-of-spark-dataframe-filter-method/

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                davies Davies Liu
                Reporter:
                rxin Reynold Xin
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: