[SPARK-10392] Pyspark - Wrong DateType support on JDBC connection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.1
Fix Version/s: 1.5.1, 1.6.0
Component/s: PySpark, SQL
Labels:
None

Description

I have following problem.
I created table.

CREATE TABLE `spark_test` (
	`id` INT(11) NULL,
	`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');

Then I'm trying to read data - date '1970-01-01' is converted to int. This makes data frame incompatible with its own schema.

df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-ebc1d94e0d8c> in <module>()
      1 df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test')
      2 print(df.collect())
----> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    402 
    403         if isinstance(data, RDD):
--> 404             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    405         else:
    406             rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
    296             rows = rdd.take(10)
    297             for row in rows:
--> 298                 _verify_type(row, schema)
    299 
    300         else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152                              "length of fields (%d)" % (len(obj), len(dataType.fields)))
   1153         for v, f in zip(obj, dataType.fields):
-> 1154             _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136         # subclass of them can not be fromInternald in JVM
   1137         if type(obj) not in _acceptable_types[_type]:
-> 1138             raise TypeError("%s can not accept object in type %s" % (dataType, type(obj)))
   1139 
   1140     if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type <class 'int'>

Attachments

Issue Links

links to

[Github] Pull Request #8556 (0x0FFF)

Activity

People

Assignee:: Alexey Grishchenko

Reporter:: Maciej Bryński

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Sep/15 08:55

Updated:: 07/Sep/15 18:41

Resolved:: 01/Sep/15 21:58