Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.3.0
    • None
    • SQL
    • None

    Description

      In pandas it is common to use numpy.nan as the null value, for missing data or whatever.

      http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
      http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
      http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna

      createDataFrame however only works with None as null values, parsing them as None in the RDD.

      I suggest to add support for np.nan values in pandas DataFrames.

      current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats)

      TypeError                                 Traceback (most recent call last)
      <ipython-input-38-34f0263f0bf4> in <module>()
      ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
      
      /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
          339             schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio)
          340 
      --> 341         return self.applySchema(data, schema)
          342 
          343     def registerDataFrameAsTable(self, rdd, tableName):
      
      /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema)
          246 
          247         for row in rows:
      --> 248             _verify_type(row, schema)
          249 
          250         # convert python objects to sql data
      
      /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
         1064                              "length of fields (%d)" % (len(obj), len(dataType.fields)))
         1065         for v, f in zip(obj, dataType.fields):
      -> 1066             _verify_type(v, f.dataType)
         1067 
         1068 _cached_cls = weakref.WeakValueDictionary()
      
      /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
         1048     if type(obj) not in _acceptable_types[_type]:
         1049         raise TypeError("%s can not accept object in type %s"
      -> 1050                         % (dataType, type(obj)))
         1051 
         1052     if isinstance(dataType, ArrayType):
      
      TypeError: StringType can not accept object in type <type 'float'>

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            davies Davies Liu Assign to me
            fabboe Fabian Boehnlein
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment