[SPARK-6573] Convert inbound NaN values as null - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

In pandas it is common to use numpy.nan as the null value, for missing data or whatever.

http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna

createDataFrame however only works with None as null values, parsing them as None in the RDD.

I suggest to add support for np.nan values in pandas DataFrames.

current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats)

TypeError                                 Traceback (most recent call last)
<ipython-input-38-34f0263f0bf4> in <module>()
----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    339             schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio)
    340 
--> 341         return self.applySchema(data, schema)
    342 
    343     def registerDataFrameAsTable(self, rdd, tableName):

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema)
    246 
    247         for row in rows:
--> 248             _verify_type(row, schema)
    249 
    250         # convert python objects to sql data

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1064                              "length of fields (%d)" % (len(obj), len(dataType.fields)))
   1065         for v, f in zip(obj, dataType.fields):
-> 1066             _verify_type(v, f.dataType)
   1067 
   1068 _cached_cls = weakref.WeakValueDictionary()

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1048     if type(obj) not in _acceptable_types[_type]:
   1049         raise TypeError("%s can not accept object in type %s"
-> 1050                         % (dataType, type(obj)))
   1051 
   1052     if isinstance(dataType, ArrayType):

TypeError: StringType can not accept object in type <type 'float'>

Attachments

Issue Links

is related to

SPARK-8797 Sorting float/double column containing NaNs can lead to "Comparison method violates its general contract!" errors

Resolved

links to

[Github] Pull Request #7332 (davies)

mailing list discussion

Activity

People

Assignee:: Davies Liu

Reporter:: Fabian Boehnlein

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Mar/15 15:07

Updated:: 18/Jul/15 01:32

Resolved:: 17/Jul/15 08:15