Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10392

Pyspark - Wrong DateType support on JDBC connection

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.1
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      I have following problem.
      I created table.

      CREATE TABLE `spark_test` (
      	`id` INT(11) NULL,
      	`date` DATE NULL
      )
      COLLATE='utf8_general_ci'
      ENGINE=InnoDB
      ;
      INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
      

      Then I'm trying to read data - date '1970-01-01' is converted to int. This makes data frame incompatible with its own schema.

      df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test')
      print(df.collect())
      df = sqlCtx.createDataFrame(df.rdd, df.schema)
      
      [Row(id=1, date=0)]
      ---------------------------------------------------------------------------
      TypeError                                 Traceback (most recent call last)
      <ipython-input-36-ebc1d94e0d8c> in <module>()
            1 df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test')
            2 print(df.collect())
      ----> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
      
      /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
          402 
          403         if isinstance(data, RDD):
      --> 404             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
          405         else:
          406             rdd, schema = self._createFromLocal(data, schema)
      
      /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
          296             rows = rdd.take(10)
          297             for row in rows:
      --> 298                 _verify_type(row, schema)
          299 
          300         else:
      
      /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
         1152                              "length of fields (%d)" % (len(obj), len(dataType.fields)))
         1153         for v, f in zip(obj, dataType.fields):
      -> 1154             _verify_type(v, f.dataType)
         1155 
         1156 
      
      /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
         1136         # subclass of them can not be fromInternald in JVM
         1137         if type(obj) not in _acceptable_types[_type]:
      -> 1138             raise TypeError("%s can not accept object in type %s" % (dataType, type(obj)))
         1139 
         1140     if isinstance(dataType, ArrayType):
      
      TypeError: DateType can not accept object in type <class 'int'>
      
      

        Attachments

          Activity

            People

            • Assignee:
              0x0fff Alexey Grishchenko
              Reporter:
              maver1ck Maciej BryƄski
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: