Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19732

DataFrame.fillna() does not work for bools in PySpark

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.1.0
    • 2.3.0
    • PySpark
    • None

    Description

      In PySpark, the fillna function of DataFrame inadvertently casts bools to ints, so fillna cannot be used to fill True/False.

      e.g. `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()`
      yields
      `[Row(a=True), Row(a=None)]`
      It should be a=True for the second Row

      The cause is this bit of code:

      if isinstance(value, (int, long)):
                  value = float(value)
      

      There needs to be a separate check for isinstance(bool), since in python, bools are ints too

      Additionally there's another anomaly:
      Spark (and pyspark) supports filling of bools if you specify the args as a map:

      fillna({"a": False})
      

      , but not if you specify it as

      fillna(False)
      

      This is because (scala-)Spark has no

      def fill(value: Boolean): DataFrame = fill(value, df.columns)
      

      method. I find that strange/buggy

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            RBerenguel Ruben Berenguel
            lenfrodge Len Frodgers
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment