[SPARK-26767] Filter on a dropDuplicates dataframe gives inconsistency result - ASF JIRA

XML

Word

Printable

JSON

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
Environment:

Hide

To repeat the problem,

(1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).

(2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))

(3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))

=> When (3) is rerun, it gives different number of resulting rows.

Show
To repeat the problem, (1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC). (2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates( ["colA", "colB", "colC"] )) (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False)) => When (3) is rerun, it gives different number of resulting rows.

To repeat the problem,

(1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).

(2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))

(3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))

=> When (3) is rerun, it gives different number of resulting rows.

is related to

SPARK-27213 Unexpected results when filter is used after distinct