Details
-
Improvement
-
Status: Resolved
-
Trivial
-
Resolution: Fixed
-
1.6.0
-
None
Description
I found it confusing that a Row with an omitted field is different from a row with field present but value missing. This was originally problematic for json files will varying fields, but it's comes down to something like:
def test(rows):
ds = sc.parallelize(rows)
df = sqlContext.createDataFrame(ds,None,1)
print df[['y']].collect()
test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works
test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an ArrayIndexOutOfBoundsException.
maybe more could be said in the documentation for createDataFrame or Row about what's expected. Validation or correction would be helpful, as would a function creating a well formed row from a structtype and dictionary.
Attachments
Issue Links
- is related to
-
SPARK-12624 When schema is specified, we should give better error message if actual row length doesn't match
- Resolved
- links to