Details
-
Question
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0
-
None
-
spark2.3
standalone
Description
Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", ",").option("inferSchema", "true")
.option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
.option("encoding", "UTF-8")
.load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
System.out.println("source count="+dataset.count());
Dataset<Row> dropDuplicates = dataset.dropDuplicates(new String[]{"DATE","TIME","VEL","COMPANY"});
System.out.println("dropDuplicates count1="+dropDuplicates.count());
System.out.println("dropDuplicates count2="+dropDuplicates.count());
Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 and (status = 0 or status = 1)");
System.out.println("filter count1="+filter.count());
System.out.println("filter count2="+filter.count());
System.out.println("filter count3="+filter.count());
System.out.println("filter count4="+filter.count());
System.out.println("filter count5="+filter.count());
------------------------------------------------------The above is code ---------------------------------------
console output:
source count=459275
dropDuplicates count1=453987
dropDuplicates count2=453987
filter count1=445798
filter count2=445797
filter count3=445797
filter count4=445798
filter count5=445799
question:
Why is filter.count() different everytime?
if I remove dropDuplicates() everything will be ok!!
Attachments
Issue Links
- is related to
-
SPARK-27213 Unexpected results when filter is used after distinct
- Resolved