Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5063

Display more helpful error messages for several invalid operations

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.1, 1.3.0
    • Spark Core
    • None

    Description

      Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:

      (those are just a sample of the ones that I've answered personally; there are many others).

      I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

      In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors:

      rdd1 = sc.parallelize(range(100))
      rdd2 = sc.parallelize(range(100))
      rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
      

      produces

      [...]
        File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save
          rv = reduce(self.proto)
        File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
        File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 304, in get_return_value
      py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace:
      py4j.Py4JException: Method __getnewargs__([]) does not exist
      	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
      	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
      	at py4j.Gateway.invoke(Gateway.java:252)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:207)
      	at java.lang.Thread.run(Thread.java:745)
      

      We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the getnewargs method to throw a more useful UnsupportedOperation exception with a more helpful error message.

      Users may also see confusing NPEs when calling methods on stopped SparkContexts, so I've added checks for that as well.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment