[SPARK-10847] Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.3, 1.6.1, 2.0.0
Component/s: PySpark, SQL
Labels:
None
Environment:

Windows 7
java version "1.8.0_60" (64bit)
Python 3.4.x

Standalone cluster mode (not local[n]; a full local cluster)

Description

If the optional metadata passed to `pyspark.sql.types.StructField` includes a pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a very cryptic/unhelpful error.

Here is a minimal reproducible example:

# Assumes sc exists
import pyspark.sql.types as types
sqlContext = SQLContext(sc)


literal_metadata = types.StructType([
    types.StructField(
        'name',
        types.StringType(),
        nullable=True,
        metadata={'comment': 'From accounting system.'}
        ),
    types.StructField(
        'age',
        types.IntegerType(),
        nullable=True,
        metadata={'comment': None}
        ),
    ])

literal_rdd = sc.parallelize([
    ['Bob', 34],
    ['Dan', 42],
    ])
print(literal_rdd.take(2))

failed_dataframe = sqlContext.createDataFrame(
    literal_rdd,
    literal_metadata,
    )

This produces the following ~stacktrace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 28, in <module>
  File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py", line 408, in createDataFrame
    jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco
    return f(*a, **kw)
  File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.
	at org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
	at org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
	at org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
	at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
	at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
	at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
	at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
	at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
	at org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
	at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Unknown Source)

I believe the most important line of the traceback is this one:

py4j.protocol.Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.

But it wasn't enough for me to figure out the problem; I had to steadily simplify my program until I could identify what caused the problem.

Attachments

Issue Links

links to

[Github] Pull Request #8969 (jasoncl)

Activity

Ascending order - Click to sort in descending order

Jason C Lee added a comment - 28/Sep/15 18:19

I would like to work on this.

Jason C Lee added a comment - 28/Sep/15 18:19 I would like to work on this.

Shea Parkes added a comment - 28/Sep/15 20:30

This issue caused me to learn enough about Scala only to learn that the exception still wasn't helpful once I even knew what a scala.Tuple2 was.

I'm not planning on doing any further work on this, so to the extent you were waiting to avoid duplication of efforts with me, feel free to go ahead and knock it out. I'm not entirely familiar with the contribution guidelines, but I'm sure you can work them out.

In case it wasn't clear above, the line that triggers the error is:

metadata={'comment': None}

Thanks for the interest!

Shea Parkes added a comment - 28/Sep/15 20:30 This issue caused me to learn enough about Scala only to learn that the exception still wasn't helpful once I even knew what a scala.Tuple2 was. I'm not planning on doing any further work on this, so to the extent you were waiting to avoid duplication of efforts with me, feel free to go ahead and knock it out. I'm not entirely familiar with the contribution guidelines, but I'm sure you can work them out. In case it wasn't clear above, the line that triggers the error is: metadata={'comment': None} Thanks for the interest!

Jason C Lee added a comment - 28/Sep/15 21:14

Instead of
Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.

Would it be helpful if the error message is this:
Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class java.lang.String : class org.json4s.JsonAST$JNull$.

Jason C Lee added a comment - 28/Sep/15 21:14 Instead of Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD. : java.lang.RuntimeException: Do not support type class scala.Tuple2. Would it be helpful if the error message is this: Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD. : java.lang.RuntimeException: Do not support type class java.lang.String : class org.json4s.JsonAST$JNull$.

Apache Spark added a comment - 02/Oct/15 23:02

User 'jasoncl' has created a pull request for this issue:
https://github.com/apache/spark/pull/8969

Apache Spark added a comment - 02/Oct/15 23:02 User 'jasoncl' has created a pull request for this issue: https://github.com/apache/spark/pull/8969

Shea Parkes added a comment - 02/Oct/15 23:20

I appreciate your assistance! I think your proposal is an improvement, but I think it would be better if the failure was triggered upon the creation of the StructType object - that's where the error actually occurred.

The distance between the definition of the metadata and the import was much larger in my project; I think your new error message would still have me looking for NULL values in my data (instead of my metadata). That's likely a part of my unfamiliarity of Scala, but I chased as far down the pyspark code as I could go and didn't figure it out without trial and error.

I realize this might mean traversing an arbitrary dictionary in the StructType initialization looking for unallowed types, which might be unacceptable. It would still be much more in line with "Crash Early, Crash Often" philosophy if it were possible to bomb at the creation of the metadata.

Thanks again for the assistance!

Shea Parkes added a comment - 02/Oct/15 23:20 I appreciate your assistance! I think your proposal is an improvement, but I think it would be better if the failure was triggered upon the creation of the StructType object - that's where the error actually occurred. The distance between the definition of the metadata and the import was much larger in my project; I think your new error message would still have me looking for NULL values in my data (instead of my metadata). That's likely a part of my unfamiliarity of Scala, but I chased as far down the pyspark code as I could go and didn't figure it out without trial and error. I realize this might mean traversing an arbitrary dictionary in the StructType initialization looking for unallowed types, which might be unacceptable. It would still be much more in line with "Crash Early, Crash Often" philosophy if it were possible to bomb at the creation of the metadata. Thanks again for the assistance!

Shea Parkes added a comment - 02/Oct/15 23:21

My apologies, I just read your patch and see you made it work even with Pythonic Nulls. You rule sir; thanks a bunch.

Shea Parkes added a comment - 02/Oct/15 23:21 My apologies, I just read your patch and see you made it work even with Pythonic Nulls. You rule sir; thanks a bunch.

Jason C Lee added a comment - 05/Oct/15 17:08

You're welcome!

Jason C Lee added a comment - 05/Oct/15 17:08 You're welcome!

Yin Huai added a comment - 27/Jan/16 17:56

Issue resolved by pull request 8969
https://github.com/apache/spark/pull/8969

Yin Huai added a comment - 27/Jan/16 17:56 Issue resolved by pull request 8969 https://github.com/apache/spark/pull/8969

People

Assignee:: Jason C Lee

Reporter:: Shea Parkes

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Sep/15 19:20

Updated:: 27/Jan/16 17:56

Resolved:: 27/Jan/16 17:56