Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25420

Dataset.count() every time is different.

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
    • Environment:

      spark2.3

      standalone

    • Target Version/s:

      Description

      Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", ",").option("inferSchema", "true")
      .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
      .option("encoding", "UTF-8")
      .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");

      System.out.println("source count="+dataset.count());

      Dataset<Row> dropDuplicates = dataset.dropDuplicates(new String[]{"DATE","TIME","VEL","COMPANY"});
      System.out.println("dropDuplicates count1="+dropDuplicates.count());
      System.out.println("dropDuplicates count2="+dropDuplicates.count());

      Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 and (status = 0 or status = 1)");

      System.out.println("filter count1="+filter.count());
      System.out.println("filter count2="+filter.count());
      System.out.println("filter count3="+filter.count());
      System.out.println("filter count4="+filter.count());
      System.out.println("filter count5="+filter.count());

       

       

      ------------------------------------------------------The above is code ---------------------------------------

       
       
      console output:

      source count=459275
      dropDuplicates count1=453987
      dropDuplicates count2=453987
      filter count1=445798
      filter count2=445797
      filter count3=445797
      filter count4=445798
      filter count5=445799

       

      question:
       
      Why is filter.count() different everytime?

      if I remove dropDuplicates() everything will be ok!!

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                zhiyin1233 huanghuai
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: