Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19732

DataFrame.fillna() does not work for bools in PySpark

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      In PySpark, the fillna function of DataFrame inadvertently casts bools to ints, so fillna cannot be used to fill True/False.

      e.g. `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()`
      yields
      `[Row(a=True), Row(a=None)]`
      It should be a=True for the second Row

      The cause is this bit of code:

      if isinstance(value, (int, long)):
                  value = float(value)
      

      There needs to be a separate check for isinstance(bool), since in python, bools are ints too

      Additionally there's another anomaly:
      Spark (and pyspark) supports filling of bools if you specify the args as a map:

      fillna({"a": False})
      

      , but not if you specify it as

      fillna(False)
      

      This is because (scala-)Spark has no

      def fill(value: Boolean): DataFrame = fill(value, df.columns)
      

      method. I find that strange/buggy

        Attachments

          Activity

            People

            • Assignee:
              RBerenguel Ruben Berenguel
              Reporter:
              lenfrodge Len Frodgers
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: