Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13748

Document behavior of createDataFrame and rows with omitted fields

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 1.6.0
    • 2.2.0
    • Documentation, PySpark, SQL
    • None

    Description

      I found it confusing that a Row with an omitted field is different from a row with field present but value missing. This was originally problematic for json files will varying fields, but it's comes down to something like:

      def test(rows):
      ds = sc.parallelize(rows)
      df = sqlContext.createDataFrame(ds,None,1)
      print df[['y']].collect()

      test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works
      test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an ArrayIndexOutOfBoundsException.

      maybe more could be said in the documentation for createDataFrame or Row about what's expected. Validation or correction would be helpful, as would a function creating a well formed row from a structtype and dictionary.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              eaubin234 Ethan Aubin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: