[SPARK-22566] Better error message for `_merge_type` in Pandas to Spark DF conversion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: PySpark
Labels:
None

Description

When creating a Spark DF from a Pandas DF without specifying a schema, schema inference is used. This inference can fail when a column contains values of two different types; this is ok. The problem is the error message does not tell us in which column this happened.

When this happens, it is painful to debug since the error message is too vague.

I plan on submitting a PR which fixes this, providing a better error message for such cases, containing the column name (and possibly the problematic values too).

>>> spark_session.createDataFrame(pandas_df)
File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
for f in a.fields]
File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

Attachments

Issue Links

links to

[Github] Pull Request #19792 (gberger)

Activity

People

Assignee:: Guilherme Berger

Reporter:: Guilherme Berger

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Nov/17 19:06

Updated:: 08/Jan/18 05:33

Resolved:: 08/Jan/18 05:33