[SPARK-8568] Prevent accidental use of "and" and "or" to build invalid expressions in Python - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.1, 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.4.1, 1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

In Spark DataFrames (and in Pandas as well), the correct way to construct a conjunctive expression is to use the bitwise and operator, i.e.: "(x > 5) & (y > 6)".

However, a lot of users assume that they should be using the Python "and" keyword, i.e. doing "x > 5 and y > 6". Python's boolean evaluation logic converts "x > 5 and y > 6" into just "y > 6" (since "x > 5" is not None). This is super confusing & error prone.

We should override _bool_ and _nonzero_ for Column to throw an exception if users call "and" and "or" on Column expressions.

Background: see this blog post http://www.nodalpoint.com/unexpected-behavior-of-spark-dataframe-filter-method/

Attachments

Issue Links

is duplicated by

SPARK-8573 For PySpark's DataFrame API, we need to throw exceptions when users try to use and/or/not

Resolved

links to

[Github] Pull Request #6961 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Jun/15 18:30

Updated:: 02/Jul/15 22:37

Resolved:: 24/Jun/15 17:51