Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
1.3.0
-
None
-
None
Description
In pandas it is common to use numpy.nan as the null value, for missing data or whatever.
http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
createDataFrame however only works with None as null values, parsing them as None in the RDD.
I suggest to add support for np.nan values in pandas DataFrames.
current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats)
TypeError Traceback (most recent call last) <ipython-input-38-34f0263f0bf4> in <module>() ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 --> 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: --> 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): -> 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError("%s can not accept object in type %s" -> 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type <type 'float'>
Attachments
Issue Links
- is related to
-
SPARK-8797 Sorting float/double column containing NaNs can lead to "Comparison method violates its general contract!" errors
- Resolved
- links to