Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13748

Document behavior of createDataFrame and rows with omitted fields

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 2.2.0
    • Component/s: Documentation, PySpark, SQL
    • Labels:
      None

      Description

      I found it confusing that a Row with an omitted field is different from a row with field present but value missing. This was originally problematic for json files will varying fields, but it's comes down to something like:

      def test(rows):
      ds = sc.parallelize(rows)
      df = sqlContext.createDataFrame(ds,None,1)
      print df[['y']].collect()

      test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works
      test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an ArrayIndexOutOfBoundsException.

      maybe more could be said in the documentation for createDataFrame or Row about what's expected. Validation or correction would be helpful, as would a function creating a well formed row from a structtype and dictionary.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                eaubin234 Ethan Aubin
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: