Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10847

Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.5.0
    • 1.5.3, 1.6.1, 2.0.0
    • PySpark, SQL
    • None
    • Windows 7
      java version "1.8.0_60" (64bit)
      Python 3.4.x

      Standalone cluster mode (not local[n]; a full local cluster)

    Description

      If the optional metadata passed to `pyspark.sql.types.StructField` includes a pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a very cryptic/unhelpful error.

      Here is a minimal reproducible example:

      # Assumes sc exists
      import pyspark.sql.types as types
      sqlContext = SQLContext(sc)
      
      
      literal_metadata = types.StructType([
          types.StructField(
              'name',
              types.StringType(),
              nullable=True,
              metadata={'comment': 'From accounting system.'}
              ),
          types.StructField(
              'age',
              types.IntegerType(),
              nullable=True,
              metadata={'comment': None}
              ),
          ])
      
      literal_rdd = sc.parallelize([
          ['Bob', 34],
          ['Dan', 42],
          ])
      print(literal_rdd.take(2))
      
      failed_dataframe = sqlContext.createDataFrame(
          literal_rdd,
          literal_metadata,
          )
      

      This produces the following ~stacktrace:

      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<string>", line 28, in <module>
        File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py", line 408, in createDataFrame
          jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
        File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
        File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco
          return f(*a, **kw)
        File "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
      : java.lang.RuntimeException: Do not support type class scala.Tuple2.
      	at org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
      	at org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
      	at org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
      	at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
      	at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
      	at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
      	at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
      	at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
      	at org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
      	at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      	at java.lang.reflect.Method.invoke(Unknown Source)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
      	at py4j.Gateway.invoke(Gateway.java:259)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:207)
      	at java.lang.Thread.run(Unknown Source)
      

      I believe the most important line of the traceback is this one:

      py4j.protocol.Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
      : java.lang.RuntimeException: Do not support type class scala.Tuple2.
      

      But it wasn't enough for me to figure out the problem; I had to steadily simplify my program until I could identify what caused the problem.

      Attachments

        Activity

          jasoncl Jason C Lee added a comment -

          I would like to work on this.

          jasoncl Jason C Lee added a comment - I would like to work on this.
          shea.parkes Shea Parkes added a comment -

          This issue caused me to learn enough about Scala only to learn that the exception still wasn't helpful once I even knew what a scala.Tuple2 was.

          I'm not planning on doing any further work on this, so to the extent you were waiting to avoid duplication of efforts with me, feel free to go ahead and knock it out. I'm not entirely familiar with the contribution guidelines, but I'm sure you can work them out.

          In case it wasn't clear above, the line that triggers the error is:

          metadata={'comment': None}
          

          Thanks for the interest!

          shea.parkes Shea Parkes added a comment - This issue caused me to learn enough about Scala only to learn that the exception still wasn't helpful once I even knew what a scala.Tuple2 was. I'm not planning on doing any further work on this, so to the extent you were waiting to avoid duplication of efforts with me, feel free to go ahead and knock it out. I'm not entirely familiar with the contribution guidelines, but I'm sure you can work them out. In case it wasn't clear above, the line that triggers the error is: metadata={'comment': None} Thanks for the interest!
          jasoncl Jason C Lee added a comment -

          Instead of
          Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
          : java.lang.RuntimeException: Do not support type class scala.Tuple2.

          Would it be helpful if the error message is this:
          Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD.
          : java.lang.RuntimeException: Do not support type class java.lang.String : class org.json4s.JsonAST$JNull$.

          jasoncl Jason C Lee added a comment - Instead of Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD. : java.lang.RuntimeException: Do not support type class scala.Tuple2. Would it be helpful if the error message is this: Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD. : java.lang.RuntimeException: Do not support type class java.lang.String : class org.json4s.JsonAST$JNull$.
          apachespark Apache Spark added a comment -

          User 'jasoncl' has created a pull request for this issue:
          https://github.com/apache/spark/pull/8969

          apachespark Apache Spark added a comment - User 'jasoncl' has created a pull request for this issue: https://github.com/apache/spark/pull/8969
          shea.parkes Shea Parkes added a comment -

          I appreciate your assistance! I think your proposal is an improvement, but I think it would be better if the failure was triggered upon the creation of the StructType object - that's where the error actually occurred.

          The distance between the definition of the metadata and the import was much larger in my project; I think your new error message would still have me looking for NULL values in my data (instead of my metadata). That's likely a part of my unfamiliarity of Scala, but I chased as far down the pyspark code as I could go and didn't figure it out without trial and error.

          I realize this might mean traversing an arbitrary dictionary in the StructType initialization looking for unallowed types, which might be unacceptable. It would still be much more in line with "Crash Early, Crash Often" philosophy if it were possible to bomb at the creation of the metadata.

          Thanks again for the assistance!

          shea.parkes Shea Parkes added a comment - I appreciate your assistance! I think your proposal is an improvement, but I think it would be better if the failure was triggered upon the creation of the StructType object - that's where the error actually occurred. The distance between the definition of the metadata and the import was much larger in my project; I think your new error message would still have me looking for NULL values in my data (instead of my metadata). That's likely a part of my unfamiliarity of Scala, but I chased as far down the pyspark code as I could go and didn't figure it out without trial and error. I realize this might mean traversing an arbitrary dictionary in the StructType initialization looking for unallowed types, which might be unacceptable. It would still be much more in line with "Crash Early, Crash Often" philosophy if it were possible to bomb at the creation of the metadata. Thanks again for the assistance!
          shea.parkes Shea Parkes added a comment -

          My apologies, I just read your patch and see you made it work even with Pythonic Nulls. You rule sir; thanks a bunch.

          shea.parkes Shea Parkes added a comment - My apologies, I just read your patch and see you made it work even with Pythonic Nulls. You rule sir; thanks a bunch.
          jasoncl Jason C Lee added a comment -

          You're welcome!

          jasoncl Jason C Lee added a comment - You're welcome!
          yhuai Yin Huai added a comment -

          Issue resolved by pull request 8969
          https://github.com/apache/spark/pull/8969

          yhuai Yin Huai added a comment - Issue resolved by pull request 8969 https://github.com/apache/spark/pull/8969

          People

            jason412 Jason C Lee
            shea.parkes Shea Parkes
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: