Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5063

Display more helpful error messages for several invalid operations

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.1, 1.3.0
    • Component/s: Spark Core
    • Labels:
      None
    • Target Version/s:

      Description

      Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:

      (those are just a sample of the ones that I've answered personally; there are many others).

      I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

      In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors:

      rdd1 = sc.parallelize(range(100))
      rdd2 = sc.parallelize(range(100))
      rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
      

      produces

      [...]
        File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save
          rv = reduce(self.proto)
        File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
        File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 304, in get_return_value
      py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace:
      py4j.Py4JException: Method __getnewargs__([]) does not exist
      	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
      	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
      	at py4j.Gateway.invoke(Gateway.java:252)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:207)
      	at java.lang.Thread.run(Thread.java:745)
      

      We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the getnewargs method to throw a more useful UnsupportedOperation exception with a more helpful error message.

      Users may also see confusing NPEs when calling methods on stopped SparkContexts, so I've added checks for that as well.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joshrosen Josh Rosen
                Reporter:
                joshrosen Josh Rosen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: