Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31600

Error message from DataFrame creation is misleading.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

      DataBricks 6.4, Spark 2.4.5, Scala 2.11

      Description

      Description:

      DataFrame creation from pandas.DataFrame fails when one of the features contains only NaN values (which is ok).

      However, error message mentions wrong feature as the culprit, which makes it hard to find the root cause.

      How to reproduce:

       

      import numpy as np
      import pandas as pd
      df2 = pd.DataFrame({'a': np.array([np.nan, np.nan], dtype=np.object_), 'b': [np.nan, 'aaa']})
      display(spark.createDataFrame(df2[['b']]))   # Works fine
      spark.createDataFrame(df2)            # Raises TypeError.
      

      In the code above, column 'a' is bad. However, the `TypeError` raised in the last command mentions feature 'b' as the culprit:

      TypeError: field b: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              olexiy Olexiy Oryeshko
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: