[SPARK-23463] Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Not A Problem
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was introduced to look at the table stats and decide filter selectivity. However, since then, filter has started behaving unexpectedly for blank values. The operation would not only drop columns with blank values but also filter out rows that actually meet the filter criteria.

Steps to repro

Consider a simple dataframe with some blank values as below:

dev	val
ALL	0.01
ALL	0.02
ALL	0.004
ALL
ALL	2.5
ALL	4.5
ALL	45

Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)

dev	val

However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0

df.filter(df["val"] > 0.0)

dev	val
ALL	0.01
ALL	0.02
ALL	0.004
ALL	2.5
ALL	4.5
ALL	45

Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition.

Also, if there are no blank values, the filter operation works as expected for all versions.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample
19/Feb/18 19:06
0.1 kB
Manan Bakshi

Activity

People

Assignee:: Unassigned

Reporter:: Manan Bakshi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Feb/18 08:47

Updated:: 21/Feb/18 19:05

Resolved:: 21/Feb/18 19:05