Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26767

Filter on a dropDuplicates dataframe gives inconsistency result

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Description

      To repeat the problem,

      (1) create a csv file with records holding same values for a subset of columns (e.g. colA, colB, colC).

      (2) read the csv file as a spark dataframe and then use dropDuplicates to dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))

      (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 'A' and colB='B' and colG='G' and colH='H').show(100,False))

       

      => When (3) is rerun, it gives different number of resulting rows.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jeffrey.mak Jeffrey
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: