Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26767

Filter on a dropDuplicates dataframe gives inconsistency result

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      To repeat the problem,

      (1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).

      (2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))

      (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))

       

      => When (3) is rerun, it gives different number of resulting rows.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jeffrey.mak Jeffrey
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: