Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.0
-
None
-
To repeat the problem,
(1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).
(2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
(3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))
=> When (3) is rerun, it gives different number of resulting rows.
To repeat the problem, (1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC). (2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates( ["colA", "colB", "colC"] )) (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False)) => When (3) is rerun, it gives different number of resulting rows.
Description
To repeat the problem,
(1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).
(2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
(3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))
=> When (3) is rerun, it gives different number of resulting rows.
Attachments
Issue Links
- is related to
-
SPARK-27213 Unexpected results when filter is used after distinct
- Resolved