[SPARK-13748] Document behavior of createDataFrame and rows with omitted fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 2.2.0
Component/s: Documentation, PySpark, SQL
Labels:
None

Description

I found it confusing that a Row with an omitted field is different from a row with field present but value missing. This was originally problematic for json files will varying fields, but it's comes down to something like:

def test(rows):
ds = sc.parallelize(rows)
df = sqlContext.createDataFrame(ds,None,1)
print df[['y']].collect()

test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works
test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an ArrayIndexOutOfBoundsException.

maybe more could be said in the documentation for createDataFrame or Row about what's expected. Validation or correction would be helpful, as would a function creating a well formed row from a structtype and dictionary.

Attachments

Issue Links

is related to

SPARK-12624 When schema is specified, we should give better error message if actual row length doesn't match

Resolved

links to

[Github] Pull Request #13771 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Ethan Aubin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Mar/16 19:10

Updated:: 12/Dec/22 17:35

Resolved:: 07/Jan/17 12:52